learn-video
learn-video
Pull a video from any yt-dlp-supported site (1800+ including YouTube, TikTok, Vimeo), align its audio with what's on screen, and produce a fused.md timeline ready for /learn-eval to save as a SKILL.md, rule, tip, or note.
Unlike transcript-only skills, this one uses an LLM to pick which timestamps need a visual reference, extracts just those frames with ffmpeg, and asks a vision model to fuse each frame with its transcript window. The output reads as AUDIO / VISUAL / FUSED blocks aligned to [mm:ss] markers.
When to use
- User shares a tutorial, talk, or short-form dev clip and wants the knowledge, not the video.
- User wants a consistent pipeline from many videos (e.g. an entire creator's channel) into the ECC knowledge system.
- Output needs to flow into
/learn-eval— therecommended-form:header infused.mdis designed for that handoff.
When not to use
- User wants the video file itself, subtitles, or clipped highlights — use
yt-dlpor a clipping tool directly. - Only the raw transcript is needed — a lightweight transcript skill is faster (no vision calls, no rate limiting).
- Live streams or content gated behind authentication.
Quick start
# one-time setup (from this repo)
./setup.sh
# run it
python -m learn_video.cli run "https://www.youtube.com/watch?v=<id>"
# flags
python -m learn_video.cli run "<url>" \
--tier=lite|pro|max # model quality preset
--model=<provider:model> # override vision model (highest precedence)
--offline # route to local Ollama
--fresh # bypass cache
--short # force whole-video File API upload (fails if >60s)
--notes-only # skip /learn-eval handoff
Stdout is the path to fused.md. Stderr carries [N/6 STAGE] progress lines. Claude should read fused.md after the CLI returns and hand off to /learn-eval unless --notes-only.
Invocation from the slash command
The slash command ~/.claude/commands/learn-video.md drives Claude to:
- Parse the URL and flags.
- Shell out to
python -m learn_video.cli run "$URL" [flags]from the repo root (or whereverlearn_videois importable). - Read the resulting
fused.md. - Summarize the timeline for the user (title,
video-kind,recommended-form, block count, 1–2 highlightFUSED:lines). - Unless
--notes-only, invoke/learn-evalwith thefused.mdcontents.
Pipeline stages (in learn_video/)
| Stage | Module | Purpose |
|---|---|---|
| 1. Ingest | ingest.py |
yt-dlp metadata probe + download; picks 720p/480p/360p by duration |
| 2. Transcribe | transcribe.py |
Platform auto-captions first, faster-whisper small.en fallback |
| Probe | probe.py |
Classifies video as `visual |
| 3a. Target | target.py |
LLM picks 3–15 timestamps worth a frame (sliding 15-min windows for >25-min videos) |
| 3b. Summary | summary.py |
Audio-first path: chapter-style transcript summary, no frames |
| 4. Keyframes | keyframes.py |
ffmpeg extracts one JPEG per target timestamp |
| 5. Vision | vision.py |
Per-frame fusion, 6-way concurrent; plus whole-video File API path for <60s clips |
| Classify | classify.py |
Suggests recommended-form: `skill |
| 6. Fuse | fuse.py |
Composes fused.md with headers + timeline |
Model portability
Only model_client.py imports langchain-*. Every stage calls:
build_chat_model(model_id, **kwargs)—<provider>:<model>id strings (e.g.google_genai:gemini-flash-lite-latest,anthropic:claude-opus-4-7,ollama:qwen2.5vl:3b).invoke_structured(model, schema, messages)— Pydantic schema validation with ajson_repairfallback and an empty-response short-circuit.invoke_vision(model, vision_input)— portableimage_url+ data-URL shape that works on Gemini, Claude, and Ollama;video_pathis Gemini-only (File API upload).
A sliding-window rate limiter (13 req / 60s on Flash Lite) gates every provider call. To swap models, edit TIER_MODELS in config.py. Stage code does not change.
Tier → model mapping
--tier |
Targeting | Vision | Quota notes |
|---|---|---|---|
lite (default) |
gemini-flash-lite-latest |
gemini-flash-lite-latest |
~100 videos/day on free 500 RPD |
pro |
gemini-flash-lite-latest |
gemini-flash-latest |
20 RPD cap on Flash |
max |
gemini-flash-lite-latest |
claude-opus-4-7 |
Uses Claude quota |
--offline |
qwen2.5vl:3b |
qwen2.5vl:3b |
Local Ollama, slow |
Targeting always stays on Flash Lite — it's a cheap text task, and keeping the premium budget for vision is almost always correct.
Cache
~/.claude/cache/learn-video/<video-id>/ contains meta.json, video.*, captions.vtt, transcript.json, targets.json, frames/*.jpg, fused.md. Every stage checks its artifact before running — re-runs on the same URL skip straight to whatever changed.
python -m learn_video.cli cache-info # list cached videos
python -m learn_video.cli cache-clean <video-id> # remove one
python -m learn_video.cli cache-clean all # nuke everything
Error taxonomy
| Class | Behavior | Example |
|---|---|---|
TransientError |
Auto-retries (tenacity: 5s, 10s, 20s with jitter; max 3) | 429, httpx.ReadError, timeout |
ConfigurationError |
Fail fast with fix hint | missing GEMINI_API_KEY |
EnvironmentError_ |
Fail fast with install command | ffmpeg not found |
TargetError |
Unrecoverable | DRM-blocked, deleted video |
Requirements
- Python 3.13 (tested; 3.11+ should work —
tomllibfallback totomlialready in place) ffmpegon PATHyt-dlpon PATH (installed viapip install -r requirements.txt)GEMINI_API_KEY(orGOOGLE_API_KEY) for the defaultlite/protiersANTHROPIC_API_KEYif using--tier=maxor--model anthropic:*- Local Ollama with
qwen2.5vl:3bpulled if using--offline
See README.md for a full architecture diagram and the 12-video demo output.