Video Frames

Extract frames from video files using ffmpeg, producing JPEG images optimized for LLM vision analysis. Supports multiple frame-selection strategies (fixed FPS, scene detection, target frame count), quality presets, model-aware dimension optimization, and OCR enhancements.

Prerequisites

ffmpeg and ffprobe must be installed and on PATH:

brew install ffmpeg  # macOS

Workflow

Receive a video file path from the user
Run scripts/extract_frames.py to extract JPEG frames
Parse the JSON output for frame paths, resolution, and token estimates
Read the extracted frames as image attachments for analysis
Answer the user's question about the video content
Clean up temp directories when done

Quick Start

The simplest invocation -- extracts 1 frame/second at balanced quality:

python3 scripts/extract_frames.py video.mp4

For most use cases, use --max-frames to let the script auto-calculate FPS:

python3 scripts/extract_frames.py video.mp4 --max-frames 30

This is the preferred approach over manually setting --fps, since it adapts to any video length and keeps the frame count predictable.

Presets

Four quality presets control resolution, JPEG quality, and image processing:

Preset	Max dim	Quality	Extras	Best for
`efficient`	768px	5	--	Bulk frames, long videos
`balanced`	1024px	3	--	General analysis (default)
`detailed`	1568px	2	--	Fine detail, small objects
`ocr`	1568px	1	grayscale + high contrast + sharpen	Text/document extraction

# Long video, keep costs low
python3 scripts/extract_frames.py long_video.mp4 --max-frames 20 --preset efficient

# Need to read text in a screencast
python3 scripts/extract_frames.py screencast.mp4 --max-frames 40 --preset ocr

Quality (1=best, 31=worst) and max dimension can be overridden independently:

python3 scripts/extract_frames.py video.mp4 --preset balanced --quality 1 --max-dimension 1568

Scene-Change Detection

Instead of extracting at a fixed rate, detect visual scene changes and extract one frame per scene. This is ideal for videos with distinct segments (presentations, edited footage, tutorials).

python3 scripts/extract_frames.py video.mp4 --scene-threshold 0.3

--scene-threshold (float, 0.0-1.0): Sensitivity. Lower = more sensitive, detects smaller changes. Start with 0.3 (the default when the flag is used).
--min-scene-interval (float, default: 1.0): Minimum seconds between detected scenes. Prevents burst detections during rapid cuts.

Note: --fps and --scene-threshold are mutually exclusive. --max-frames can only be used with --fps mode, not scene detection.

# Presentation with clear slide transitions
python3 scripts/extract_frames.py presentation.mp4 --scene-threshold 0.2

# Action footage -- less sensitive, min 2s apart
python3 scripts/extract_frames.py action.mp4 --scene-threshold 0.5 --min-scene-interval 2.0

Model-Aware Optimization

Use --target-model to resize frames to dimensions that align with a specific model's tile boundaries, minimizing wasted tokens:

Model	Max dim	Rationale
`claude`	1568px	Max native resolution before auto-resize
`openai`	768px	Aligned to 512px tile grid (shortest side 768)
`gemini`	768px	Aligned to 768px tile boundaries
`universal`	768px	Sweet spot across all models (default)

# Optimized for Claude -- maximum detail
python3 scripts/extract_frames.py video.mp4 --max-frames 30 --target-model claude

# Optimized for GPT-4o -- efficient tile packing
python3 scripts/extract_frames.py video.mp4 --max-frames 30 --target-model openai

--target-model sets the max dimension unless --max-dimension is explicitly provided (CLI override takes priority).

See references/llm-image-specs.md for detailed token formulas, tile calculations, and optimal dimension tables for each model.

OCR and Grayscale Mode

For videos containing text (screencasts, presentations, documents, terminal recordings):

# Full OCR pipeline via preset
python3 scripts/extract_frames.py screencast.mp4 --preset ocr --max-frames 40

# Manual OCR flags (can combine with any preset)
python3 scripts/extract_frames.py video.mp4 --grayscale --high-contrast

--grayscale: Converts frames to grayscale. Reduces file size ~60% with no OCR accuracy loss.
--high-contrast: Applies contrast=1.3, brightness=0.05 to improve text/background separation.
The ocr preset enables both flags plus unsharp-mask sharpening at 1568px, quality 1 (best JPEG).

Advanced Options

FPS Selection Guide

When using --fps directly instead of --max-frames:

Video length	Recommended fps	Rationale
< 30s	2-5	Short clip, capture detail
30s - 5min	1	Good balance of coverage vs frame count
5min - 30min	0.5	Avoid excessive frames
> 30min	0.1 - 0.2	Sample key moments only

Keep total frame count under ~50 for optimal LLM context usage. Formula: duration_seconds * fps = frame_count.

Prefer --max-frames over manual FPS -- it auto-calculates the right rate and clamps to 0.05-30.0 FPS.

Timestamp Overlay

python3 scripts/extract_frames.py video.mp4 --timestamps --max-frames 30

Overlays the source filename and hh:mm:ss timestamp in the bottom-right corner of each frame (white text on semi-transparent black box). Use when the user needs to reference specific moments in the video.

All CLI Options Reference

Option	Type	Default	Description
`video_path`	pos.	(required)	Path to the video file
`--fps`	float	1.0	Frames per second (mutually exclusive with `--scene-threshold`)
`--scene-threshold`	float	--	Scene-change sensitivity 0.0-1.0 (mutually exclusive with `--fps`)
`--min-scene-interval`	float	1.0	Min seconds between scene-change frames
`--max-frames`	int	--	Auto-calculate FPS to produce ~N frames
`--preset`	choice	balanced	`efficient` / `balanced` / `detailed` / `ocr`
`--max-dimension`	int	--	Override max pixel dimension (longest edge)
`--quality`	int	--	JPEG quality 1-31 (1=best, 31=worst)
`--target-model`	choice	--	`claude` / `openai` / `gemini` / `universal`
`--grayscale`	flag	off	Convert to grayscale
`--high-contrast`	flag	off	Boost contrast for text readability
`--timestamps`	flag	off	Overlay filename + timestamp on frames
`--output-dir`	string	temp dir	Output directory for extracted frames

Output JSON Structure

The script prints JSON to stdout with the following structure:

{
  "output_dir": "/tmp/video_frames_abc123/",
  "frames": ["/tmp/video_frames_abc123/frame_00001.jpg", "..."],
  "preset": "balanced",
  "resolution": { "width": 1024, "height": 576 },
  "token_estimate": {
    "frame_count": 30,
    "per_frame": {
      "claude": 787,
      "openai_high": 765,
      "openai_low": 85,
      "openai_patch": 934,
      "gemini": 258
    },
    "total": {
      "claude": 23610,
      "openai_high": 22950,
      "openai_low": 2550,
      "openai_patch": 28020,
      "gemini": 7740
    }
  },
  "summary": {
    "video_duration_seconds": 120.5,
    "extraction_method": "max_frames",
    "scene_changes_detected": null,
    "frames_extracted": 30,
    "estimated_total_tokens": {
      "claude": 23610,
      "openai_high": 22950,
      "openai_low": 2550,
      "openai_patch": 28020,
      "gemini": 7740
    }
  }
}

Use token_estimate.total to verify the frame set fits within model context limits before attaching frames to a prompt.

Note: openai_high and openai_low are for legacy models (GPT-4o, GPT-4.1). openai_patch is for newer models (gpt-5.4+, gpt-5-mini, o4-mini). See references/llm-image-specs.md for details.

On error, JSON with an "error" key is printed to stderr and the script exits with code 1.

After Extraction

Parse the JSON output to get the list of frame paths from frames
Check token_estimate.total to ensure the frames fit within context limits
Read each frame image using the Read tool (they are JPEG files)
Analyze the frames to answer the user's question
Clean up: delete the output directory when done if it was a temp dir

video-frames