video-perception
Video Perception
You have access to video understanding tools via the claude-video-vision MCP server.
Available Tools
video_analyze— Analyze video structure with ffmpeg filters (scene changes, silence, motion, etc.). Use this BEFORE extracting frames to plan your strategy.video_watch— Extract frames + process audio from a video. Supports variable FPS/resolution per segment.video_detail— Drill into specific segments. Separates extraction from viewing — extract many frames, view few at a time.video_info— Get video metadata without processing.video_configure— Change settings (backend, resolution, enable_index, etc.).video_setup— Check/install dependencies.
Workflow
IMPORTANT: You MUST follow these steps in order. Do NOT skip step 2.
-
Always start with
video_infoto get duration, resolution, and audio presence. -
REQUIRED for videos > 30s: Call
video_analyzeBEFORE extracting any frames. This is NOT optional — it gives you structural data to make smart extraction decisions. Select filters relevant to the user's question:User intent Filters to select "What happens in this video?" scene_changes, silence, transcription "Find the scene transitions" scene_changes, black_intervals "Are there frozen/stuck parts?" freeze, blur "Is this a talking head or action?" motion "When does the music start?" silence, loudness "Analyze the lighting" exposure "Summarize this lecture" transcription, scene_changes, silence General / unclear intent scene_changes, silence, transcription Always include
transcription: truewhen the video has audio — the transcription tells you WHERE to look visually. -
Use the analysis results and transcription to plan your frame extraction strategy:
- Low FPS (0.1-0.5) for static or predictable segments
- Higher FPS (1-3) only around scene changes, motion peaks, or moments referenced in speech ("look at this", "as you can see", "let me show you")
- Never exceed the minimum FPS needed for the task
- Prefer fewer segments at lower FPS — you can always drill deeper
-
Call
video_watchto extract frames:- For short videos (< 2 minutes): Use
fps: "auto"withoutview_sample— short videos need full coverage to avoid missing brief moments. The auto FPS already adapts to duration. - For long videos (> 2 minutes): Use
segmentsbased on analysis data with variable FPS, andview_sampleto limit initial frame count. You can always drill deeper withvideo_detail.
- For short videos (< 2 minutes): Use
-
Use
video_detailto drill into specific moments:- Start with 3-5 second windows around points of interest
- Use
view_sample: 3to preview (first, middle, last frame) - Then request specific timestamps with
viewif you need more detail - Expand the window only if the initial view is insufficient
- Treat frame viewing like a binary search — narrow down to what matters
- Never view all extracted frames at once
-
When the user asks follow-up questions about the same video, consult the manifest already in your context. Do not re-extract frames you already have at the same resolution. Do not re-request frames you already have in context.
Parameter Guide
fps: "auto" for general overview. Use the video's original fps (from video_info) for frame-by-frame detail. Use 5-10 for analyzing specific short moments. Use 0.1-0.5 for long videos.
resolution: 256-512 for quick scans. 512-768 for normal analysis. 1024+ when reading on-screen text or fine details.
segments: Use when you have analysis data. Each segment can have its own fps and resolution. Overrides global fps/start_time/end_time.
view_sample: Returns N evenly spaced frames from the extracted set. Use this to avoid flooding context with too many images.
skip_audio: Set to true when you only need visual analysis.
Working with Results
You receive:
- Manifest (when enable_index is on) — index of all cached frames by resolution and timestamp. Use this to avoid redundant requests.
- Frames as images — look at them to understand what's happening visually
- Audio transcription with timestamps — read the speech content
- Audio tags — non-speech events (music, sounds, etc.)
- Analysis data — scene changes, silence intervals, motion levels, etc.
Combine all sources to form a complete understanding. Use analysis + transcription to guide where you look visually. The analysis tells you WHEN things happen; the frames tell you WHAT happens.