video-frames
Video Frames
Extract frames from video files using ffmpeg, producing JPEG images optimized for LLM vision analysis. Supports multiple frame-selection strategies (fixed FPS, scene detection, target frame count), quality presets, model-aware dimension optimization, and OCR enhancements.
Prerequisites
ffmpeg and ffprobe must be installed and on PATH:
brew install ffmpeg # macOS
Workflow
- Receive a video file path from the user
- Run
scripts/extract_frames.pyto extract JPEG frames - Parse the JSON output for frame paths, resolution, and token estimates
- Read the extracted frames as image attachments for analysis
- Answer the user's question about the video content
- Clean up temp directories when done
Quick Start
The simplest invocation -- extracts 1 frame/second at balanced quality:
python3 scripts/extract_frames.py video.mp4
For most use cases, use --max-frames to let the script auto-calculate FPS:
python3 scripts/extract_frames.py video.mp4 --max-frames 30
This is the preferred approach over manually setting --fps, since it adapts to any video length and keeps the frame count predictable.
Presets
Four quality presets control resolution, JPEG quality, and image processing:
| Preset | Max dim | Quality | Extras | Best for |
|---|---|---|---|---|
efficient |
768px | 5 | -- | Bulk frames, long videos |
balanced |
1024px | 3 | -- | General analysis (default) |
detailed |
1568px | 2 | -- | Fine detail, small objects |
ocr |
1568px | 1 | grayscale + high contrast + sharpen | Text/document extraction |
# Long video, keep costs low
python3 scripts/extract_frames.py long_video.mp4 --max-frames 20 --preset efficient
# Need to read text in a screencast
python3 scripts/extract_frames.py screencast.mp4 --max-frames 40 --preset ocr
Quality (1=best, 31=worst) and max dimension can be overridden independently:
python3 scripts/extract_frames.py video.mp4 --preset balanced --quality 1 --max-dimension 1568
Scene-Change Detection
Instead of extracting at a fixed rate, detect visual scene changes and extract one frame per scene. This is ideal for videos with distinct segments (presentations, edited footage, tutorials).
python3 scripts/extract_frames.py video.mp4 --scene-threshold 0.3
--scene-threshold(float, 0.0-1.0): Sensitivity. Lower = more sensitive, detects smaller changes. Start with0.3(the default when the flag is used).--min-scene-interval(float, default: 1.0): Minimum seconds between detected scenes. Prevents burst detections during rapid cuts.
Note: --fps and --scene-threshold are mutually exclusive. --max-frames can only be used with --fps mode, not scene detection.
# Presentation with clear slide transitions
python3 scripts/extract_frames.py presentation.mp4 --scene-threshold 0.2
# Action footage -- less sensitive, min 2s apart
python3 scripts/extract_frames.py action.mp4 --scene-threshold 0.5 --min-scene-interval 2.0
Model-Aware Optimization
Use --target-model to resize frames to dimensions that align with a specific model's tile boundaries, minimizing wasted tokens:
| Model | Max dim | Rationale |
|---|---|---|
claude |
1568px | Max native resolution before auto-resize |
openai |
768px | Aligned to 512px tile grid (shortest side 768) |
gemini |
768px | Aligned to 768px tile boundaries |
universal |
768px | Sweet spot across all models (default) |
# Optimized for Claude -- maximum detail
python3 scripts/extract_frames.py video.mp4 --max-frames 30 --target-model claude
# Optimized for GPT-4o -- efficient tile packing
python3 scripts/extract_frames.py video.mp4 --max-frames 30 --target-model openai
--target-model sets the max dimension unless --max-dimension is explicitly provided (CLI override takes priority).
See references/llm-image-specs.md for detailed token formulas, tile calculations, and optimal dimension tables for each model.
OCR and Grayscale Mode
For videos containing text (screencasts, presentations, documents, terminal recordings):
# Full OCR pipeline via preset
python3 scripts/extract_frames.py screencast.mp4 --preset ocr --max-frames 40
# Manual OCR flags (can combine with any preset)
python3 scripts/extract_frames.py video.mp4 --grayscale --high-contrast
--grayscale: Converts frames to grayscale. Reduces file size ~60% with no OCR accuracy loss.--high-contrast: Appliescontrast=1.3, brightness=0.05to improve text/background separation.- The
ocrpreset enables both flags plus unsharp-mask sharpening at 1568px, quality 1 (best JPEG).
Advanced Options
FPS Selection Guide
When using --fps directly instead of --max-frames:
| Video length | Recommended fps | Rationale |
|---|---|---|
| < 30s | 2-5 | Short clip, capture detail |
| 30s - 5min | 1 | Good balance of coverage vs frame count |
| 5min - 30min | 0.5 | Avoid excessive frames |
| > 30min | 0.1 - 0.2 | Sample key moments only |
Keep total frame count under ~50 for optimal LLM context usage. Formula: duration_seconds * fps = frame_count.
Prefer --max-frames over manual FPS -- it auto-calculates the right rate and clamps to 0.05-30.0 FPS.
Timestamp Overlay
python3 scripts/extract_frames.py video.mp4 --timestamps --max-frames 30
Overlays the source filename and hh:mm:ss timestamp in the bottom-right corner of each frame (white text on semi-transparent black box). Use when the user needs to reference specific moments in the video.
All CLI Options Reference
| Option | Type | Default | Description |
|---|---|---|---|
video_path |
pos. | (required) | Path to the video file |
--fps |
float | 1.0 | Frames per second (mutually exclusive with --scene-threshold) |
--scene-threshold |
float | -- | Scene-change sensitivity 0.0-1.0 (mutually exclusive with --fps) |
--min-scene-interval |
float | 1.0 | Min seconds between scene-change frames |
--max-frames |
int | -- | Auto-calculate FPS to produce ~N frames |
--preset |
choice | balanced | efficient / balanced / detailed / ocr |
--max-dimension |
int | -- | Override max pixel dimension (longest edge) |
--quality |
int | -- | JPEG quality 1-31 (1=best, 31=worst) |
--target-model |
choice | -- | claude / openai / gemini / universal |
--grayscale |
flag | off | Convert to grayscale |
--high-contrast |
flag | off | Boost contrast for text readability |
--timestamps |
flag | off | Overlay filename + timestamp on frames |
--output-dir |
string | temp dir | Output directory for extracted frames |
Output JSON Structure
The script prints JSON to stdout with the following structure:
{
"output_dir": "/tmp/video_frames_abc123/",
"frames": ["/tmp/video_frames_abc123/frame_00001.jpg", "..."],
"preset": "balanced",
"resolution": { "width": 1024, "height": 576 },
"token_estimate": {
"frame_count": 30,
"per_frame": {
"claude": 787,
"openai_high": 765,
"openai_low": 85,
"openai_patch": 934,
"gemini": 258
},
"total": {
"claude": 23610,
"openai_high": 22950,
"openai_low": 2550,
"openai_patch": 28020,
"gemini": 7740
}
},
"summary": {
"video_duration_seconds": 120.5,
"extraction_method": "max_frames",
"scene_changes_detected": null,
"frames_extracted": 30,
"estimated_total_tokens": {
"claude": 23610,
"openai_high": 22950,
"openai_low": 2550,
"openai_patch": 28020,
"gemini": 7740
}
}
}
Use token_estimate.total to verify the frame set fits within model context limits before attaching frames to a prompt.
Note:
openai_highandopenai_loware for legacy models (GPT-4o, GPT-4.1).openai_patchis for newer models (gpt-5.4+, gpt-5-mini, o4-mini). Seereferences/llm-image-specs.mdfor details.
On error, JSON with an "error" key is printed to stderr and the script exits with code 1.
After Extraction
- Parse the JSON output to get the list of frame paths from
frames - Check
token_estimate.totalto ensure the frames fit within context limits - Read each frame image using the Read tool (they are JPEG files)
- Analyze the frames to answer the user's question
- Clean up: delete the output directory when done if it was a temp dir