learn-video

Pull a video from any yt-dlp-supported site (1800+ including YouTube, TikTok, Vimeo), align its audio with what's on screen, and produce a fused.md timeline ready for /learn-eval to save as a SKILL.md, rule, tip, or note.

Unlike transcript-only skills, this one uses an LLM to pick which timestamps need a visual reference, extracts just those frames with ffmpeg, and asks a vision model to fuse each frame with its transcript window. The output reads as AUDIO / VISUAL / FUSED blocks aligned to [mm:ss] markers.

When to use

User shares a tutorial, talk, or short-form dev clip and wants the knowledge, not the video.
User wants a consistent pipeline from many videos (e.g. an entire creator's channel) into the ECC knowledge system.
Output needs to flow into /learn-eval — the recommended-form: header in fused.md is designed for that handoff.

When not to use

User wants the video file itself, subtitles, or clipped highlights — use yt-dlp or a clipping tool directly.
Only the raw transcript is needed — a lightweight transcript skill is faster (no vision calls, no rate limiting).
Live streams or content gated behind authentication.

Quick start

# one-time setup (from this repo)
./setup.sh

# run it
python -m learn_video.cli run "https://www.youtube.com/watch?v=<id>"

# flags
python -m learn_video.cli run "<url>" \
  --tier=lite|pro|max        # model quality preset
  --model=<provider:model>   # override vision model (highest precedence)
  --offline                  # route to local Ollama
  --fresh                    # bypass cache
  --short                    # force whole-video File API upload (fails if >60s)
  --notes-only               # skip /learn-eval handoff

Stdout is the path to fused.md. Stderr carries [N/6 STAGE] progress lines. Claude should read fused.md after the CLI returns and hand off to /learn-eval unless --notes-only.

Invocation from the slash command

The slash command ~/.claude/commands/learn-video.md drives Claude to:

Parse the URL and flags.
Shell out to python -m learn_video.cli run "$URL" [flags] from the repo root (or wherever learn_video is importable).
Read the resulting fused.md.
Summarize the timeline for the user (title, video-kind, recommended-form, block count, 1–2 highlight FUSED: lines).
Unless --notes-only, invoke /learn-eval with the fused.md contents.

Pipeline stages (in `learn_video/`)

Stage	Module	Purpose
1. Ingest	`ingest.py`	`yt-dlp` metadata probe + download; picks 720p/480p/360p by duration
2. Transcribe	`transcribe.py`	Platform auto-captions first, `faster-whisper small.en` fallback
Probe	`probe.py`	Classifies video as `visual
3a. Target	`target.py`	LLM picks 3–15 timestamps worth a frame (sliding 15-min windows for >25-min videos)
3b. Summary	`summary.py`	Audio-first path: chapter-style transcript summary, no frames
4. Keyframes	`keyframes.py`	`ffmpeg` extracts one JPEG per target timestamp
5. Vision	`vision.py`	Per-frame fusion, 6-way concurrent; plus whole-video File API path for <60s clips
Classify	`classify.py`	Suggests `recommended-form`: `skill
6. Fuse	`fuse.py`	Composes `fused.md` with headers + timeline

Model portability

Only model_client.py imports langchain-*. Every stage calls:

build_chat_model(model_id, **kwargs) — <provider>:<model> id strings (e.g. google_genai:gemini-flash-lite-latest, anthropic:claude-opus-4-7, ollama:qwen2.5vl:3b).
invoke_structured(model, schema, messages) — Pydantic schema validation with a json_repair fallback and an empty-response short-circuit.
invoke_vision(model, vision_input) — portable image_url + data-URL shape that works on Gemini, Claude, and Ollama; video_path is Gemini-only (File API upload).

A sliding-window rate limiter (13 req / 60s on Flash Lite) gates every provider call. To swap models, edit TIER_MODELS in config.py. Stage code does not change.

Tier → model mapping

`--tier`	Targeting	Vision	Quota notes
`lite` (default)	`gemini-flash-lite-latest`	`gemini-flash-lite-latest`	~100 videos/day on free 500 RPD
`pro`	`gemini-flash-lite-latest`	`gemini-flash-latest`	20 RPD cap on Flash
`max`	`gemini-flash-lite-latest`	`claude-opus-4-7`	Uses Claude quota
`--offline`	`qwen2.5vl:3b`	`qwen2.5vl:3b`	Local Ollama, slow

Targeting always stays on Flash Lite — it's a cheap text task, and keeping the premium budget for vision is almost always correct.

Cache

~/.claude/cache/learn-video/<video-id>/ contains meta.json, video.*, captions.vtt, transcript.json, targets.json, frames/*.jpg, fused.md. Every stage checks its artifact before running — re-runs on the same URL skip straight to whatever changed.

python -m learn_video.cli cache-info             # list cached videos
python -m learn_video.cli cache-clean <video-id> # remove one
python -m learn_video.cli cache-clean all        # nuke everything

Error taxonomy

Class	Behavior	Example
`TransientError`	Auto-retries (tenacity: 5s, 10s, 20s with jitter; max 3)	429, `httpx.ReadError`, timeout
`ConfigurationError`	Fail fast with fix hint	missing `GEMINI_API_KEY`
`EnvironmentError_`	Fail fast with install command	`ffmpeg not found`
`TargetError`	Unrecoverable	DRM-blocked, deleted video

Requirements

Python 3.13 (tested; 3.11+ should work — tomllib fallback to tomli already in place)
ffmpeg on PATH
yt-dlp on PATH (installed via pip install -r requirements.txt)
GEMINI_API_KEY (or GOOGLE_API_KEY) for the default lite / pro tiers
ANTHROPIC_API_KEY if using --tier=max or --model anthropic:*
Local Ollama with qwen2.5vl:3b pulled if using --offline

See README.md for a full architecture diagram and the 12-video demo output.

learn-video

learn-video

When to use

When not to use

Quick start

Invocation from the slash command

Pipeline stages (in learn_video/)

Model portability

Tier → model mapping

Cache

Error taxonomy

Requirements

Pipeline stages (in `learn_video/`)