iopho-voiceover-tts
iopho-voiceover-tts
Multi-engine TTS voiceover production for video projects. Three modes: audition → generate → assemble.
Prerequisites
# Check tools
python3 --version # ✓ 3.10+
pip show edge-tts # ✓ for free TTS (pip install edge-tts if missing)
ffmpeg -version # ✓ for assembly
# Check API keys (only for paid engines)
echo $ELEVENLABS_API_KEY # for ElevenLabs
echo $MINIMAX_API_KEY # for MiniMax
# Edge TTS = FREE, no key needed
Modes
Mode 1: audition — Compare voices before committing
Generate the SAME line with 2-3 different voices so the user can listen and pick.
Workflow:
- User provides a sample line (ideally the hero line from the VO script)
- Detect language (or use
--lang) - Generate with 2-3 voices from the appropriate engine(s)
- Present files for user to listen and choose
Engine routing for audition:
- English → ElevenLabs (Will, Adam, Antoni) — if API key available
- English (no key) → Edge TTS (JennyNeural, GuyNeural, SoniaNeural)
- Chinese → MiniMax (Gentleman) + Edge TTS (XiaoxiaoNeural, YunxiNeural)
Example:
# English audition — generates 3 files
python3 scripts/generate-voiceover.py --voice will --scene hook-1a
python3 scripts/generate-voiceover.py --voice adam --scene hook-1a
python3 scripts/generate-voiceover.py --voice antoni --scene hook-1a
For new projects: Write a minimal script or use the engine API directly:
# ElevenLabs one-shot
import requests
resp = requests.post(
f"https://api.elevenlabs.io/v1/text-to-speech/{voice_id}",
headers={"xi-api-key": api_key, "Content-Type": "application/json"},
json={
"text": "Your sample line here",
"model_id": "eleven_multilingual_v2",
"voice_settings": {"stability": 0.50, "similarity_boost": 0.75, "style": 0.35}
}
)
with open("audition-sample.mp3", "wb") as f:
f.write(resp.content)
# Edge TTS one-shot (FREE)
import asyncio, edge_tts
async def gen():
tts = edge_tts.Communicate("Your sample line", "en-US-GuyNeural", rate="-5%")
await tts.save("audition-sample.mp3")
asyncio.run(gen())
Mode 2: generate — Produce all VO segments
Generate all voiceover segments for the project from a VO script or cue list.
Input formats accepted:
- VO script markdown — like
{project}/vo-script.md(scenes with text + timecodes) - JSON cue file — like
src/i18n/video-strings.json({lang}.subtitles.cues[]) - Storyboard — extract VO lines from
.storyboard.mdscene descriptions - Inline text — direct text for one-off generation
Workflow:
- Read the cue source (script, JSON, or storyboard)
- For each segment: detect language → route to engine → generate MP3
- Rate-limit API calls (0.5s between ElevenLabs calls)
- Report: segment name, duration, file size, cost estimate
Output: Individual MP3 files in {output-dir}/voiceover/ named by segment:
voiceover/
├── hook-1a.mp3
├── hook-1b.mp3
├── save-1.mp3
├── read-1.mp3
├── ...
└── remember-2.mp3
Mode 3: assemble — Build master VO track
Combine individual segments into a single master track with precise timing.
The Cascade Algorithm (proven in production):
For each segment:
actual_start = max(previous_segment_end, intended_start)
delay_ms = actual_start / fps * 1000
This ensures NO overlap and NO unintended gaps. If a segment runs long, subsequent segments cascade (shift later) rather than overlap.
ffmpeg assembly command structure:
ffmpeg -i seg1.mp3 -i seg2.mp3 ... -i segN.mp3 \
-filter_complex "
[0]adelay=D1|D1[a0];
[1]adelay=D2|D2[a1];
...
[N]adelay=DN|DN[aN];
[a0][a1]...[aN]amix=inputs=N:normalize=0
" \
-codec:a libmp3lame -b:a 128k -ar 44100 master-vo.mp3
Key: normalize=0 prevents amix from reducing volume.
Output:
master-vo.mp3— single combined trackmaster-vo-positions.json— timeline metadata for Remotion AudioLayer integration:
[{"name": "hook-1a", "intended_f": 160, "actual_f": 160, "delay_ms": 5333, "duration_s": 3.5, "end_f": 265}]
Existing Scripts Reference
These scripts in scripts/ are PROVEN in production. Adapt them for new projects:
| Script | Engine | Language | Key Config |
|---|---|---|---|
generate-voiceover.py |
ElevenLabs | EN | Will/Adam/Antoni, stability=0.50, style=0.35 |
generate-voiceover-zh.py |
ElevenLabs | ZH | Aria, stability=0.55, style=0.20 |
generate-voiceover-zh-edge.py |
Edge TTS | ZH | XiaoxiaoNeural, rate=-5% — FREE |
generate-voiceover-zh-minimax.py |
MiniMax | ZH | Gentleman, speech-01-hd, 32kHz |
assemble-master-vo.py |
ffmpeg | EN | 10 segments, cascade, 30fps |
assemble-master-vo-zh.py |
ffmpeg | ZH | 12 segments, cascade, 30fps |
Adapting for new projects: The scripts have hardcoded project-specific paths. For new projects:
- Copy the relevant script to your project dir
- Update
VO_DIR,INTENDEDarray (scene names + frame offsets), and output path - Or use the one-shot API snippets above for simpler projects
Voice Catalog
See references/voice-catalog.md for:
- All voices × engines × languages with quality ratings
- Proven ElevenLabs settings from production
- Cost estimates by video length
- API key setup instructions
Quality Checklist
- Auditioned 2+ voices before committing
- Language detection correct for each segment
- No clipping or distortion in generated audio
- Segment durations fit within intended timing windows
- Master track has no overlaps (cascade algorithm applied)
-
positions.jsonframe offsets match Remotion composition - Cost within budget (check voice-catalog.md estimates)
Related Skills
/iopho-audio-director— uses master-vo.mp3 for ducking + assembly with BGM/iopho-video-director— calls this at Phase 2 (production)/iopho-product-context— reads context.md for language preferences/remotion-best-practices— master-vo-positions.json feeds AudioLayer.tsx
More from iopho-team/iopho-skills
pnote
pnote - The PromptNote CLI for managing prompts, notes, and snippets
9reedle
reedle - The Reedle CLI for managing your intelligent reading library and extracting content
2iopho-analyzing-videos
Reverse-engineer videos into .storyboard.md files for AI video regeneration
2iopho-searching-videos
Search videos across YouTube, Bilibili, and other platforms without downloading
2iopho-seedance-prompts
即梦 Seedance 2.0 全能提示词工程指南。根据用户视频需求生成最优 Seedance 提示词,覆盖多模态输入(图/视频/音频/文本)、@引用语法、10大能力模式、镜头语言词库。中文优先,因为 Seedance 中文理解效果最佳。Use when writing prompts for Seedance/Jimeng video generation.
2iopho-audio-director
>
2