Video Processing

Overview

This skill provides guidance for video processing tasks involving frame-level analysis, event detection, and motion tracking using computer vision libraries like OpenCV. It emphasizes verification-first approaches and guards against common pitfalls in video analysis workflows.

Core Approach: Verify Before Implementing

Before writing detection algorithms, establish ground truth understanding of the video content:

Extract and inspect sample frames - Save key frames as images to visually verify what is happening at specific frame numbers
Understand video metadata - Frame count, FPS, duration, resolution
Map expected events to frame ranges - If test data exists, understand what frames correspond to which events
Build diagnostic tools first - Frame extraction and visualization utilities provide critical insight

Workflow for Event Detection Tasks

Phase 1: Video Exploration

# Essential first steps for any video analysis task
import cv2

cap = cv2.VideoCapture(video_path)
frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
fps = cap.get(cv2.CAP_PROP_FPS)
duration = frame_count / fps

print(f"Frames: {frame_count}, FPS: {fps}, Duration: {duration:.2f}s")

Critical: Extract frames at expected event locations to verify understanding:

def save_frame(video_path, frame_num, output_path):
    cap = cv2.VideoCapture(video_path)
    cap.set(cv2.CAP_PROP_POS_FRAMES, frame_num)
    ret, frame = cap.read()
    if ret:
        cv2.imwrite(output_path, frame)
    cap.release()

# Save frames at expected event times for visual inspection
save_frame("video.mp4", 50, "frame_050.png")
save_frame("video.mp4", 60, "frame_060.png")

Phase 2: Algorithm Development

When developing detection algorithms:

Start simple - Basic frame differencing or thresholding before complex approaches
Use configurable thresholds - Avoid hardcoded magic numbers; derive from data
Test on known frames first - Verify algorithm produces expected results on frames with known ground truth
Log intermediate values - Track metrics at each frame to understand algorithm behavior

Phase 3: Validation

Before finalizing:

Sanity check outputs - Do detected events occur in reasonable order and timing?
Test on multiple videos - Verify generalization across different inputs
Compare against expected ranges - If ground truth exists, verify detection accuracy

Common Detection Approaches

Frame Differencing

Compares frames against a reference (first frame or previous frame) to detect motion:

# Background subtraction approach
first_frame = cv2.cvtColor(first_frame, cv2.COLOR_BGR2GRAY)
first_frame = cv2.GaussianBlur(first_frame, (21, 21), 0)

# For each subsequent frame
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
gray = cv2.GaussianBlur(gray, (21, 21), 0)
diff = cv2.absdiff(first_frame, gray)

Pitfall: First frame may not be a suitable reference if scene changes or camera moves.

Contour-Based Detection

Identifies objects by finding contours in thresholded images:

_, thresh = cv2.threshold(diff, 25, 255, cv2.THRESH_BINARY)
contours, _ = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

Pitfall: Threshold values (e.g., 25) and minimum contour areas are arbitrary without calibration.

Tracking Position Over Time

For detecting events like jumps or gestures, track object position across frames:

positions = []  # (frame_num, x, y, area) tuples
for frame_num in range(frame_count):
    # ... detection code ...
    if detected:
        positions.append((frame_num, cx, cy, area))

Pitfall: Coordinate systems matter. In image coordinates, Y increases downward, so "higher in frame" means smaller Y values.

Verification Strategies

1. Visual Inspection

Save frames at detected event times to verify correctness:

# After detecting takeoff at frame N
save_frame(video_path, detected_takeoff, "detected_takeoff.png")
save_frame(video_path, detected_takeoff - 5, "before_takeoff.png")
save_frame(video_path, detected_takeoff + 5, "after_takeoff.png")

2. Timing Reasonableness

Check if detected events make temporal sense:

duration_seconds = frame_count / fps
event_time = detected_frame / fps

# Example: A jump in a 4-second video shouldn't be detected in the last 0.5 seconds
if event_time > duration_seconds - 0.5:
    print("WARNING: Event detected very late in video - verify correctness")

3. Sequence Validation

Ensure events occur in logical order:

if detected_landing <= detected_takeoff:
    print("ERROR: Landing cannot occur before or at takeoff")

4. Multi-Video Testing

Test on multiple inputs early to catch overfitting to single video characteristics.

Common Pitfalls

1. No Ground Truth Verification

Problem: Relying entirely on computed metrics without visual confirmation.

Solution: Always save and inspect frames at detected event locations.

2. Confirmation Bias in Data Interpretation

Problem: When data shows unexpected patterns, inventing explanations that fit preconceptions rather than questioning assumptions.

Solution: When detection results seem wrong, investigate root causes rather than rationalizing unexpected behavior.

3. Magic Number Thresholds

Problem: Using arbitrary thresholds (500 for contour area, 25 for binary threshold) without empirical basis.

Solution: Derive thresholds from actual video data or make them configurable with sensible defaults.

4. Ignoring Detection Gaps

Problem: When detection fails for a range of frames, assuming this is expected behavior without investigation.

Solution: Investigate why detection fails - it may indicate algorithm flaws rather than expected behavior.

5. Coordinate System Confusion

Problem: Misinterpreting Y coordinates (smaller Y = higher in frame in image coordinates).

Solution: Explicitly document coordinate system assumptions and verify with visual inspection.

6. Ignoring Timing Reasonableness

Problem: Accepting detections that don't make temporal sense (e.g., event detected in last 0.8 seconds of a 4-second video).

Solution: Implement sanity checks on output timing.

7. Single Video Overfitting

Problem: Algorithm works on one video but fails on others.

Solution: Test on multiple videos early in development.

Output Format Considerations

When outputting results (e.g., to TOML, JSON):

import numpy as np

# Convert numpy types to Python native types for serialization
result = {
    "takeoff_frame": int(takeoff_frame),  # Not np.int64
    "landing_frame": int(landing_frame),
}

Debugging Checklist

When detection results are incorrect:

Have I visually inspected frames at the expected event times?
Have I visually inspected frames at my detected event times?
Do my detected times make temporal sense given video duration?
Have I verified my algorithm on frames with known ground truth?
Am I correctly interpreting the coordinate system?
Have I tested on multiple videos?
Are my thresholds derived from data or arbitrary?
When detection fails on some frames, do I understand why?

video-processing