hyperframes-media
HyperFrames Media Preprocessing
Three CLI commands that produce assets for compositions: tts (speech), transcribe (timestamps), and remove-background (transparent video). Each downloads a model on first run and caches it under ~/.cache/hyperframes/. Drop the output into the project, then reference it from the composition HTML — see the hyperframes skill for the audio/video element conventions.
Text-to-Speech (tts)
Generate speech audio locally with Kokoro-82M. No API key.
npx hyperframes tts "Text here" --voice af_nova --output narration.wav
npx hyperframes tts script.txt --voice bf_emma --output narration.wav
npx hyperframes tts --list # all 54 voices
Voice Selection
Match voice to content. Default is af_heart.
| Content type | Voice | Why |
|---|---|---|
| Product demo | af_heart/af_nova |
Warm, professional |
| Tutorial / how-to | am_adam/bf_emma |
Neutral, easy to follow |
| Marketing / promo | af_sky/am_michael |
Energetic or authoritative |
| Documentation | bf_emma/bm_george |
Clear British English, formal |
| Casual / social | af_heart/af_sky |
Approachable, natural |
Multilingual
Voice IDs encode language in the first letter: a=American English, b=British English, e=Spanish, f=French, h=Hindi, i=Italian, j=Japanese, p=Brazilian Portuguese, z=Mandarin. The CLI auto-detects the phonemizer locale from the prefix — no --lang needed when the voice matches the text.
npx hyperframes tts "La reunión empieza a las nueve" --voice ef_dora --output es.wav
npx hyperframes tts "今日はいい天気ですね" --voice jf_alpha --output ja.wav
Use --lang only to override auto-detection (stylized accents). Valid codes: en-us, en-gb, es, fr-fr, hi, it, pt-br, ja, zh. Non-English phonemization requires espeak-ng system-wide (brew install espeak-ng / apt-get install espeak-ng).
Speed
0.7-0.8— tutorial, complex content, accessibility1.0— natural pace (default)1.1-1.2— intros, transitions, upbeat content1.5+— rarely appropriate; test carefully
Long Scripts
For more than a few paragraphs, write to a .txt file and pass the path. Inputs over ~5 minutes of speech may benefit from splitting into segments.
Requirements
Python 3.8+ with kokoro-onnx and soundfile (pip install kokoro-onnx soundfile). Model downloads on first use (~311 MB + ~27 MB voices, cached in ~/.cache/hyperframes/tts/).
Transcription (transcribe)
Produce a normalized transcript.json with word-level timestamps.
npx hyperframes transcribe audio.mp3
npx hyperframes transcribe video.mp4 --model small --language es
npx hyperframes transcribe subtitles.srt # import existing
npx hyperframes transcribe subtitles.vtt
npx hyperframes transcribe openai-response.json
Language Rule (Non-Negotiable)
Never use .en models unless the user explicitly states the audio is English. .en models (small.en, medium.en) translate non-English audio into English instead of transcribing it. This silently destroys the original language.
- Language known and non-English →
--model small --language <code>(no.ensuffix) - Language known and English →
--model small.en - Language unknown →
--model small(no.en, no--language) — whisper auto-detects
Default model is small, not small.en.
Model Sizes
| Model | Size | Speed | When to use |
|---|---|---|---|
tiny |
75 MB | Fastest | Quick previews, testing pipeline |
base |
142 MB | Fast | Short clips, clear audio |
small |
466 MB | Moderate | Default — most content |
medium |
1.5 GB | Slow | Important content, noisy audio, music |
large-v3 |
3.1 GB | Slowest | Production quality |
Music with vocals: start at medium minimum; produced tracks often need manual SRT/VTT import. For caption-quality checks (mandatory after every transcription), the cleaning JS, retry rules, and the OpenAI/Groq API import path, see hyperframes/references/transcript-guide.md.
Output Shape
Compositions consume a flat array of word objects. The id field (w0, w1, ...) is added during normalization for stable references in caption overrides; it's optional for backwards compatibility.
[
{ "id": "w0", "text": "Hello", "start": 0.0, "end": 0.5 },
{ "id": "w1", "text": "world.", "start": 0.6, "end": 1.2 }
]
Background Removal (remove-background)
Remove the background from a video or image so the subject (typically a person — avatar, presenter, talking head) sits as a transparent overlay in a composition.
npx hyperframes remove-background subject.mp4 -o transparent.webm # default: VP9 alpha WebM
npx hyperframes remove-background subject.mp4 -o transparent.mov # ProRes 4444 (editing)
npx hyperframes remove-background portrait.jpg -o cutout.png # single-image cutout
npx hyperframes remove-background subject.mp4 -o transparent.webm --device cpu
npx hyperframes remove-background --info # detected providers
Uses u2net_human_seg (MIT). First run downloads ~168 MB of weights to ~/.cache/hyperframes/background-removal/models/.
Output Format
| Format | When |
|---|---|
.webm (VP9 + alpha) |
Default. Compositions play this directly via <video>. |
.mov (ProRes 4444) |
Editing in DaVinci/Premiere/FCP. Large files. |
.png |
Single-image cutout (still subject, layered over a backdrop). |
Chrome decodes VP9 alpha natively, so the .webm plugs into a composition like any other muted-autoplay video — see the hyperframes skill for the <video> track conventions.
Quality presets
--quality fast|balanced|best controls only the VP9 encoder's CRF — segmentation quality is fixed.
| Preset | CRF | When |
|---|---|---|
fast |
30 | Iterating, smaller file, looser color match |
balanced |
18 | Default. Visually identical for most uses |
best |
12 | Master / final delivery. Largest file, tightest match |
Compositing patterns — pick the right one
The cutout webm is a re-encoded copy of the source mp4's RGB. That choice has consequences depending on what you put behind it:
| Pattern | What's behind the cutout | Result |
|---|---|---|
| Cutout over a different scene (most common) | Static image, gradient, or unrelated video | Looks great. The cutout's RGB is the only source of the subject — no doubling, no edge halo. This is what remove-background is built for. |
| Cutout over its own source mp4 (text-behind-subject) | Same mp4 the cutout was generated from | Two RGB sources for the same person. At default --quality balanced (crf 18) the doubling is barely visible; at --quality fast (crf 30) you'll see a faint color shift / edge halo. Use --quality best (crf 12) for masters. |
| Cutout over a different take of the same person | Footage of the same subject | Will look like two separate people overlapping. Don't do this. |
Text-behind-subject (headline behind a presenter):
<video
src="presenter.mp4"
id="bg"
data-start="0"
data-duration="6"
data-track-index="0"
muted
playsinline
></video>
<h1 id="headline" style="z-index:2; ...">MAKE IT IN HYPERFRAMES</h1>
<div class="cutout-wrap" style="position:absolute;inset:0;z-index:3;opacity:0">
<video
src="presenter.webm"
data-start="0"
data-duration="6"
data-track-index="1"
muted
playsinline
></video>
</div>
Two key rules:
- Wrap the cutout video in a non-timed
<div>and animate the wrapper's opacity, not the video element's. The framework forces opacity:1 on active clips (any element withdata-start/data-duration), so animating the video's opacity directly is silently overridden. The wrapper has nodata-*attributes, so it's owned by your CSS/GSAP. - Both videos use
data-start="0"anddata-media-start="0"so the framework decodes them in sync from t=0. Late-mounting the cutout (data-start=3.3) introduces a seek + warm-up that lands a frame off the base mp4 — visible as one frame of misalignment at the cut.
Then GSAP-flip the wrapper opacity at the cut: tl.set(cutoutWrap, { opacity: 1 }, 3.3).
TTS → Transcribe → Captions
When there's no pre-recorded voiceover, generate one and transcribe it back to get word-level timestamps for captions:
npx hyperframes tts script.txt --voice af_heart --output narration.wav
npx hyperframes transcribe narration.wav # → transcript.json
Whisper extracts precise word boundaries from the generated audio, so caption timing matches delivery without hand-tuning.
More from marclelamy/skills
builder-review-loop
Use when one agent is implementing code and another agent must review the resulting changes, compare the summary against the actual files, decide whether to fix now or move on, and write the next tightly scoped prompt with context handoff guidance.
10code-commenting
Code commenting conventions for TypeScript/React projects. Use when adding comments to new files, reviewing uncommented code, or when user asks to document/comment code. Covers file headers, type annotations, function docs, inline comments, and what NOT to comment.
9frontend-design
Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, artifacts, posters, or applications (examples include websites, landing pages, dashboards, React components, HTML/CSS layouts, or when styling/beautifying any web UI). Generates creative, polished code and UI design that avoids generic AI aesthetics.
9find-skills
Helps users discover and install agent skills when they ask questions like "how do I do X", "find a skill for X", "is there a skill that can...", or express interest in extending capabilities. This skill should be used when the user is looking for functionality that might exist as an installable skill.
9marketing-ideas
When the user needs marketing ideas, inspiration, or strategies for their SaaS or software product. Also use when the user asks for 'marketing ideas,' 'growth ideas,' 'how to market,' 'marketing strategies,' 'marketing tactics,' 'ways to promote,' 'ideas to grow,' 'what else can I try,' 'I don't know how to market this,' 'brainstorm marketing,' or 'what marketing should I do.' Use this as a starting point whenever someone is stuck or looking for inspiration on how to grow. For specific channel execution, see the relevant skill (paid-ads, social-content, email-sequence, etc.).
8tdd
Test-driven development with red-green-refactor loop. Use when user wants to build features or fix bugs using TDD, mentions "red-green-refactor", wants integration tests, or asks for test-first development.
8