Video Use

Principle

LLM reasons from raw transcript + on-demand visuals. The only derived artifact that earns its keep is a packed phrase-level transcript (takes_packed.md). Everything else — filler tagging, retake detection, shot classification, emphasis scoring — you derive at decision time.
Audio is primary, visuals follow. Cut candidates come from speech boundaries and silence gaps. Drill into visuals only at decision points.
Ask → confirm → execute → iterate → persist. Never touch the cut until the user has confirmed the strategy in plain English.
Generalize. Do not assume what kind of video this is. Look at the material, ask the user, then edit.
Artistic freedom is the default. Every specific value, preset, font, color, duration, pitch structure, and technique in this document is a worked example from one proven video — not a mandate. Read them to understand what's possible and why each worked. Then make your own taste calls based on what the material actually is and what the user actually wants. The only things you MUST do are in the Hard Rules section below. Everything else is yours.
Invent freely. If the material calls for a technique not described here — split-screen, picture-in-picture, lower-third identity cards, reaction cuts, speed ramps, freeze frames, crossfades, match cuts, L-cuts, J-cuts, speed ramps over breath, whatever — build it. The helpers are ffmpeg and PIL. They can do anything the format supports. Do not wait for permission.
Verify your own output before showing it to the user. If you wouldn't ship it, don't present it.

Hard Rules (production correctness — non-negotiable)

These are the things where deviation produces silent failures or broken output. They are not taste, they are correctness. Memorize them.

Subtitles are applied LAST in the filter chain, after every overlay. Otherwise overlays hide captions. Silent failure.
Per-segment extract → lossless -c copy concat, not single-pass filtergraph. Otherwise you double-encode every segment when overlays are added.
30ms audio fades at every segment boundary (afade=t=in:st=0:d=0.03,afade=t=out:st={dur-0.03}:d=0.03). Otherwise audible pops at every cut.
Overlays use setpts=PTS-STARTPTS+T/TB to shift the overlay's frame 0 to its window start. Otherwise you see the middle of the animation during the overlay window.
Master SRT uses output-timeline offsets: output_time = word.start - segment_start + segment_offset. Otherwise captions misalign after segment concat.
Never cut inside a word. Snap every cut edge to a word boundary from the Scribe transcript.
Pad every cut edge. Working window: 30–200ms. Scribe timestamps drift 50–100ms — padding absorbs the drift. Tighter for fast-paced, looser for cinematic.
Word-level verbatim ASR only. Never SRT/phrase mode (loses sub-second gap data). Never normalized fillers (loses editorial signal).
Cache transcripts per source. Never re-transcribe unless the source file itself changed.
Parallel sub-agents for multiple animations. Never sequential. Spawn N at once via the Agent tool; total wall time ≈ slowest one.
Strategy confirmation before execution. Never touch the cut until the user has approved the plain-English plan.
All session outputs in <videos_dir>/edit/. Never write inside the video-use/ project directory.

Everything else in this document is a worked example. Deviate whenever the material calls for it.

Directory layout

The skill lives in video-use/. User footage lives wherever they put it. All session outputs go into <videos_dir>/edit/.

<videos_dir>/
├── <source files, untouched>
└── edit/
    ├── project.md               ← memory; appended every session
    ├── takes_packed.md          ← phrase-level transcripts, the LLM's primary reading view
    ├── edl.json                 ← cut decisions
    ├── transcripts/<name>.json  ← cached raw Scribe JSON
    ├── animations/slot_<id>/    ← per-animation source + render + reasoning
    ├── clips_graded/            ← per-segment extracts with grade + fades
    ├── master.srt               ← output-timeline subtitles
    ├── downloads/               ← yt-dlp outputs
    ├── verify/                  ← debug frames / timeline PNGs
    ├── preview.mp4
    └── final.mp4

Setup

First-time install lives in install.md (clone, deps, ffmpeg, skill registration, API key). Don't re-run it every session; on cold start just verify:

ELEVENLABS_API_KEY resolves — either in the environment or in .env at the video-use repo root. If missing, ask the user to paste one and write it to .env (never to the user's <videos_dir>).
ffmpeg + ffprobe on PATH.
Python deps installed (uv sync or pip install -e . inside the repo).
yt-dlp, manim, Remotion installed only on first use.
This skill vendors skills/manim-video/. Read its SKILL.md when building a Manim slot.

Helpers (helpers/transcribe.py, helpers/render.py, etc.) live alongside this SKILL.md. Resolve their paths relative to the directory containing this file — the skill is typically symlinked at ~/.claude/skills/video-use/ or ~/.codex/skills/video-use/.

Helpers

transcribe.py <video> — single-file Scribe call. --num-speakers N optional. Cached.
transcribe_batch.py <videos_dir> — 4-worker parallel transcription. Use for multi-take.
pack_transcripts.py --edit-dir <dir> — transcripts/*.json → takes_packed.md (phrase-level, break on silence ≥ 0.5s).
timeline_view.py <video> <start> <end> — filmstrip + waveform PNG. On-demand visual drill-down. Not a scan tool — use it at decision points, not constantly.
render.py <edl.json> -o <out> — per-segment extract → concat → overlays (PTS-shifted) → subtitles LAST. --preview for 720p fast. --build-subtitles to generate master.srt inline.
grade.py <in> -o <out> — ffmpeg filter chain grade. Presets + --filter '<raw>' for custom.

For animations, create <edit>/animations/slot_<id>/ with Bash and spawn a sub-agent via the Agent tool.

The process

Inventory. ffprobe every source. transcribe_batch.py on the directory. pack_transcripts.py to produce takes_packed.md. Sample one or two timeline_views for a visual first impression.
Pre-scan for problems. One pass over takes_packed.md to note verbal slips, obvious mis-speaks, or phrasings to avoid. Plain list, feed into the editor brief.
Converse. Describe what you see in plain English. Ask questions shaped by the material. Collect: content type, target length/aspect, aesthetic/brand direction, pacing feel, must-preserve moments, must-cut moments, animation and grade preferences, subtitle needs. Do not use a fixed checklist — the right questions are different every time.
Propose strategy. 4–8 sentences: shape, take choices, cut direction, animation plan, grade direction, subtitle style, length estimate. Wait for confirmation.
Execute. Produce edl.json via the editor sub-agent brief. Drill into timeline_view at ambiguous moments. Build animations in parallel sub-agents. Apply grade per-segment. Compose via render.py.
Preview. render.py --preview.
Self-eval (before showing the user). Run timeline_view on the rendered output (not the sources) at every cut boundary (±1.5s window). Check each image for:
- Visual discontinuity / flash / jump at the cut
- Waveform spike at the boundary (audio pop that slipped past the 30ms fade)
- Subtitle hidden behind an overlay (Rule 1 violation)
- Overlay misaligned or showing wrong frames (Rule 4 violation)
Also sample: first 2s, last 2s, and 2–3 mid-points — check grade consistency, subtitle readability, overall coherence. Run ffprobe on the output to verify duration matches the EDL expectation.

If anything fails: fix → re-render → re-eval. Cap at 3 self-eval passes — if issues remain after 3, flag them to the user rather than looping forever. Only present the preview once the self-eval passes.
Iterate + persist. Natural-language feedback, re-plan, re-render. Never re-transcribe. Final render on confirmation. Append to project.md.

Cut craft (techniques)

Audio-first. Candidate cuts from word boundaries and silence gaps.
Preserve peaks. Laughs, punchlines, emphasis beats. Extend past punchlines to include reactions — the laugh IS the beat.
Speaker handoffs benefit from air between utterances. Common values: 400–600ms. Less for fast-paced, more for cinematic. Taste call.
Audio events as signals. (laughs), (sighs), (applause) mark beats. Extend past them.
Silence gaps are cut candidates. Silences ≥400ms are usually the cleanest. 150–400ms phrase boundaries are usable with a visual check. <150ms is unsafe (mid-phrase).
Example cut padding (the launch video shipped with this): 50ms before the first kept word, 80ms after the last. Tighter for montage energy, looser for documentary. Stay in the 30–200ms working window (Hard Rule 7).
Never reason audio and video independently. Every cut must work on both tracks.

The packed transcript (primary reading view)

pack_transcripts.py reads all transcripts/*.json and produces one markdown file where each take is a list of phrase-level lines, each prefixed with its [start-end] time range. Phrases break on any silence ≥ 0.5s OR speaker change. This is the artifact the editor sub-agent reads to pick cuts — it gives word-boundary precision from text alone at 1/10 the tokens of raw JSON.

Example line:

## C0103  (duration: 43.0s, 8 phrases)
  [002.52-005.36] S0 Ninety percent of what a web agent does is completely wasted.
  [006.08-006.74] S0 We fixed this.

Editor sub-agent brief (for multi-take selection)

When the task is "pick the best take of each beat across many clips," spawn a dedicated sub-agent with a brief shaped like this. The structure is load-bearing; the pitch-shape example is not.

You are editing a <type> video. Pick the best take of each beat and 
assemble them chronologically by beat, not by source clip order.

INPUTS:
  - takes_packed.md (time-annotated phrase-level transcripts of all takes)
  - Product/narrative context: <2 sentences from the user>
  - Speaker(s): <name, role, delivery style note>
  - Expected structure: <pick an archetype or invent one>
  - Verbal slips to avoid: <list from the pre-scan pass>
  - Target runtime: <seconds>

Common structural archetypes (pick, adapt, or invent):
  - Tech launch / demo:   HOOK → PROBLEM → SOLUTION → BENEFIT → EXAMPLE → CTA
  - Tutorial:             INTRO → SETUP → STEPS → GOTCHAS → RECAP
  - Interview:            (QUESTION → ANSWER → FOLLOWUP) repeat
  - Travel / event:       ARRIVAL → HIGHLIGHTS → QUIET MOMENTS → DEPARTURE
  - Documentary:          THESIS → EVIDENCE → COUNTERPOINT → CONCLUSION
  - Music / performance:  INTRO → VERSE → CHORUS → BRIDGE → OUTRO
  - Or invent your own.

RULES:
  - Start/end times must fall on word boundaries from the transcript.
  - Pad cut boundaries (working window 30–200ms).
  - Prefer silences ≥ 400ms as cut targets.
  - Unavoidable slips are kept if no better take exists. Note them in "reason".
  - If over budget, revise: drop a beat or trim tails. Report total and self-correct.

OUTPUT (JSON array, no prose):
  [{"source": "C0103", "start": 2.42, "end": 6.85, "beat": "HOOK",
    "quote": "...", "reason": "..."}, ...]

Return the final EDL and a one-line total runtime check.

Color grade (when requested)

Your job is to reason about the image, not apply a preset. Look at a frame (via timeline_view), decide what's wrong, adjust one thing, look again.

Mental model is ASC CDL. Per channel: out = (in * slope + offset) ** power, then global saturation. slope → highlights, offset → shadows, power → midtones.

Example filter chains (grade.py has --list-presets; use them as starting points or mix your own):

warm_cinematic — retro/technical, subtle teal/orange split, desaturated. Shipped in a real launch video. Safe for talking heads.
neutral_punch — minimal corrective: contrast bump + gentle S-curve. No hue shifts.
none — straight copy. Default when the user hasn't asked.

For anything else — portraiture, nature, product, music video, documentary — invent your own chain. grade.py --filter '<raw ffmpeg>' accepts any filter string.

Hard rules: apply per-segment during extraction (not post-concat, which re-encodes twice). Never go aggressive without testing skin tones.

Subtitles (when requested)

Subtitles have three dimensions worth reasoning about: chunking (1/2/3/sentence per line), case (UPPER/Title/Natural), and placement (margin from bottom). The right combo depends on content.

Worked styles — pick, adapt, or invent:

bold-overlay — short-form tech launch, fast-paced social. 2-word chunks, UPPERCASE, break on punctuation, Helvetica 18 Bold, white-on-outline, MarginV=35. render.py ships with this as SUB_FORCE_STYLE.

FontName=Helvetica,FontSize=18,Bold=1,
PrimaryColour=&H00FFFFFF,OutlineColour=&H00000000,BackColour=&H00000000,
BorderStyle=1,Outline=2,Shadow=0,
Alignment=2,MarginV=35

natural-sentence (if you invent this mode) — narrative, documentary, education. 4–7 word chunks, sentence case, break on natural pauses, MarginV=60–80, larger font for readability, slightly wider max-width. No shipped force_style — design one if you need it.

Invent a third style if neither fits. Hard rules: subtitles LAST (Rule 1), output-timeline offsets (Rule 5).

Animations (when requested)

Animations match the content and the brand. Get the palette, font, and visual language from the conversation — never assume a default. If the user hasn't told you, propose a palette in the strategy phase and wait for confirmation before building anything.

Tool options:

PIL + PNG sequence + ffmpeg — simple overlay cards: counters, typewriter text, single bar reveals, progressive draws. Fast to iterate, any aesthetic you want. The launch video used this.
Manim — formal diagrams, state machines, equation derivations, graph morphs. Read skills/manim-video/SKILL.md and its references for depth.
Remotion — typography-heavy, brand-aligned, web-adjacent layouts. React/CSS-based.

None is mandatory. Invent hybrids if useful (e.g., PIL background with a Remotion layer on top).

Duration rules of thumb, context-dependent:

Sync-to-narration explanations. A viewer needs to parse the content at 1×. Rough floor 3s, typical 5–7s for simple cards, 8–14s for complex diagrams. The launch video shipped at 5–7s per simple card.
Beat-synced accents (music video, fast montage). 0.5–2s is fine — they're visual accents, not information. The "readable at 1×" rule becomes "recognizable at 1×", not "fully parseable."
Hold the final frame ≥ 1s before the cut (universal).
Over voiceover: total duration ≥ narration_length + 1s (universal).
Never parallel-reveal independent elements — the eye can't track two new things at once. One thing, pause, next thing.

Animation payoff timing (rule for sync-to-narration): get the payoff word's timestamp. Start the overlay reveal_duration seconds earlier so the landing frame coincides with the spoken payoff word. Without this sync the animation feels disconnected.

Easing (universal — never linear, it looks robotic):

def ease_out_cubic(t):    return 1 - (1 - t) ** 3
def ease_in_out_cubic(t):
    if t < 0.5: return 4 * t ** 3
    return 1 - (-2 * t + 2) ** 3 / 2

ease_out_cubic for single reveals (slow landing). ease_in_out_cubic for continuous draws.

Typing text anchor trick: center on the FULL string's width, not the partial-string width — otherwise text slides left during reveal.

Example palette (the launch video — one aesthetic among infinite):

Background (10, 10, 10) near-black
Accent #FF5A00 / (255, 90, 0) orange
Labels (110, 110, 110) dim gray
Font: Menlo Bold at /System/Library/Fonts/Menlo.ttc (index 1)
≤ 2 accent colors, ~40% empty space, minimal chrome
Result: terminal / retro tech feel

This is one style. If the brand is warm and serif, use that. If it's colorful and playful, use that. If the user handed you a style guide, follow it. If they didn't, propose one and confirm.

Parallel sub-agent brief — each animation is one sub-agent spawned via the Agent tool. Each prompt is self-contained (sub-agents have no parent context). Include:

One-sentence goal: "Build ONE animation: [spec]. Nothing else."
Absolute output path (<edit>/animations/slot_<id>/render.mp4)
Exact technical spec: resolution, fps, codec, pix_fmt, CRF, duration
Style palette as concrete values (RGB tuples, hex, or reference to a design system)
Font path with index
Frame-by-frame timeline (what happens when, with easing)
Anti-list ("no chrome, no extras, no titles unless specified")
Code pattern reference (copy helpers inline, don't import across slots)
Deliverable checklist (script, render, verify duration via ffprobe, report)
"Do not ask questions. If anything is ambiguous, pick the most obvious interpretation and proceed."

One sub-agent = one file (unique filenames, parallel agents don't overwrite each other).

Output spec

Match the source unless the user asked for something specific. Common targets: 1920×1080@24 cinematic, 1920×1080@30 screen content, 1080×1920@30 vertical social, 3840×2160@24 4K cinema, 1080×1080@30 square. render.py defaults the scale to 1080p from any source; pass --filter or edit the extract command for other targets. Worth asking the user which delivery format matters.

EDL format

{
  "version": 1,
  "sources": {"C0103": "/abs/path/C0103.MP4", "C0108": "/abs/path/C0108.MP4"},
  "ranges": [
    {"source": "C0103", "start": 2.42, "end": 6.85,
     "beat": "HOOK", "quote": "...", "reason": "Cleanest delivery, stops before slip at 38.46."},
    {"source": "C0108", "start": 14.30, "end": 28.90,
     "beat": "SOLUTION", "quote": "...", "reason": "Only take without the false start."}
  ],
  "grade": "warm_cinematic",
  "overlays": [
    {"file": "edit/animations/slot_1/render.mp4", "start_in_output": 0.0, "duration": 5.0}
  ],
  "subtitles": "edit/master.srt",
  "total_duration_s": 87.4
}

grade is a preset name or raw ffmpeg filter. overlays are rendered animation clips. subtitles is optional and applied LAST.

Memory — `project.md`

Append one section per session at <edit>/project.md:

## Session N — YYYY-MM-DD

**Strategy:** one paragraph describing the approach
**Decisions:** take choices, cuts, grades, animations + why
**Reasoning log:** one-line rationale for non-obvious decisions
**Outstanding:** deferred items

On startup, read project.md if it exists and summarize the last session in one sentence before asking whether to continue.

Anti-patterns

Things that consistently fail regardless of style:

Hierarchical pre-computed codec formats with USABILITY / tone tags / shot layers. Over-engineering. Derive from the transcript at decision time.
Hand-tuned moment-scoring functions. The LLM picks better than any heuristic you'll write.
Whisper SRT / phrase-level output. Loses sub-second gap data. Always word-level verbatim.
Running Whisper locally on CPU. Slow and it normalizes fillers. Use hosted Scribe.
Burning subtitles into base before compositing overlays. Overlays hide them. (Hard Rule 1.)
Single-pass filtergraph when you have overlays. Double re-encodes. Use per-segment extract → concat.
Linear animation easing. Looks robotic. Always cubic.
Hard audio cuts at segment boundaries. Audible pops. (Hard Rule 3.)
Typing text centered on the partial string. Text slides left as it grows.
Sequential sub-agents for multiple animations. Always parallel.
Editing before confirming the strategy. Never.
Re-transcribing cached sources. Immutable outputs of immutable inputs.
Assuming what kind of video it is. Look first, ask second, edit last.

video-use