tts-script-writer
TTS Script Writer
Use this skill when the user wants a solo voiceover script that will be narrated by AI text-to-speech (ElevenLabs, etc.). It ensures the script is acoustically clean (no mispronounced numbers or symbols), expressive (via audio tags and punctuation), and structurally sound for TTS generation.
Golden rule: Write for the ear, not the eye. If it looks fine on paper but sounds wrong when spoken, it's wrong.
1. Use When
- User says: "write a voiceover script", "narration script", "TTS script"
- User mentions: "ElevenLabs script", "AI voice script", "text to speech script"
- Content types: explainer videos, tutorials, audiobooks, podcast segments, e-learning modules, product demos, guided meditations, announcements
- Any single-speaker script fed to a TTS engine
2. Script Structure Templates
A. Explainer / Tutorial (2-5 min)
- Hook (0:00-0:15): One sentence that promises value or provokes curiosity.
- Context (0:15-0:45): Why this matters. One paragraph max.
- Core Concepts (0:45-3:00): 2-4 sections. One idea per section.
- Concept → Example → Takeaway
- Practical Application (3:00-4:00): Step-by-step or demo narration.
- Wrap / CTA (4:00-5:00): Summary + next step. No generic "thanks for watching."
B. Audiobook / Long-Form Narration (5+ min)
- Scene Setting: Establish mood with descriptive language and audio tags.
- Pacing Variation: Alternate between action (fast, short sentences) and reflection (slower, longer clauses with ellipses).
- Character Voices (if applicable): Use audio tags to shift delivery:
[whispers]for secrets,[angry]for conflict,[sad]for loss. - Chapter Breaks: Use
<break time="2.0s" />(v2/v2.5) or[long pause](v3) between scenes.
C. Podcast Intro / Promo (30-60s)
- Identity Line (0:00-0:05): Show name + host name. Confident, direct.
- Episode Tease (0:05-0:25): What's in this episode. One compelling fact.
- Value Proposition (0:25-0:45): Why the listener should stay.
- Call to Action (0:45-0:60): Subscribe, follow, or visit. Specific, not generic.
D. Product Demo / Announcement (1-3 min)
- Problem Statement (0:00-0:20): The pain point. Relatable language.
- Solution Reveal (0:20-0:40): Product name + one-sentence value prop.
- Feature Walkthrough (0:40-2:00): 3 features max. Benefits, not specs.
- Proof / Social (2:00-2:30): One testimonial or metric.
- CTA (2:30-3:00): Exact action. "Go to [URL]." (normalized for speech)
3. TTS Text Normalization (Non-Negotiable)
TTS models mispronounce numbers, symbols, dates, and abbreviations. Normalize ALL of these in the script before adding audio tags.
Normalization Table
| Raw Input | Spoken Form | Example |
|---|---|---|
$42.50 |
forty-two dollars and fifty cents | $99.99 → ninety-nine dollars and ninety-nine cents |
£1,001.32 |
one thousand and one pounds and thirty-two pence | |
€100 |
one hundred euros | |
¥1000 |
one thousand yen | |
1234 |
one thousand two hundred thirty-four | Expand all bare numbers > 20 |
3.14 |
three point one four | |
555-555-5555 |
five five five, five five five, five five five five | Phone numbers digit-by-digit |
2nd |
second | All ordinals |
XIV |
fourteen | Roman numerals ("the fourteenth" if a title) |
⅔ |
two-thirds | |
Dr. |
Doctor | Expand abbreviations |
Ave. |
Avenue | |
St. |
Street | But saints: "St. Patrick" stays |
Ctrl + Z |
control z | Keyboard shortcuts |
100km |
one hundred kilometers | Unit abbreviations |
100% |
one hundred percent | Percentages |
elevenlabs.io/docs |
eleven labs dot io slash docs | URLs: spell out separators |
2024-01-01 |
January first, two-thousand twenty-four | Dates |
14:30 |
two thirty PM | Times |
01/02/2023 |
January second, two-thousand twenty-three | Pick locale-appropriate form |
API |
A-P-I or "application programming interface" | Acronyms: spell out if uncommon |
HTML |
H-T-M-L or "hypertext markup language" | |
npm |
N-P-M | Package managers as letters |
JSON |
J-son or "Jay-sawn" | Choose the pronunciation you want |
Code & Technical Content
- Code snippets: Read as spoken descriptions, not literal syntax.
- Bad:
const x = useState(0) - Good: "const x equals use state zero"
- Bad:
- File paths: Spell separators.
src/components/Button.tsx→ "src slash components slash button dot tee-ess-ex" - Git commands:
git commit -m "fix"→ "git commit dash m fix" - Regex:
/^[a-z]+$/i→ "slash caret a through z plus dollar slash i"
4. Pronunciation Control
Phonetic Spelling
If a word is consistently mispronounced by your chosen voice, respell it phonetically in the script.
- Example: "trapezIi" to emphasize the "ii"
- Example: "Kubernetes" → "koo-ber-net-ees" if the voice struggles
Capitalization for Emphasis (v3)
Capital letters increase emphasis in Eleven v3:
- "It was a VERY long day."
- "This is NOT a drill."
Phoneme Tags (v2 / Flash v2 only)
For precise pronunciation of specific words, use SSML phoneme tags:
<phoneme alphabet="cmu-arpabet" ph="P R AH0 N AH0 N S IY EY1 SH AH0 N">
pronunciation
</phoneme>
Note: Phoneme tags only work with Eleven Flash v2 and Eleven English v1. Multilingual v2 and v3 do NOT support phoneme tags.
5. Pause & Pacing Control
For Eleven v2 / v2.5 / Flash v2 Models
Use <break time="x.xs" /> for natural pauses up to 3 seconds.
"Hold on, let me think." <break time="1.5s" /> "Alright, I've got it."
Caution: Too many break tags in one generation causes instability (speedups, artifacts). Use 1-2 per short script, 2-3 per long script. Prefer punctuation pauses instead.
For Eleven v3 Models
v3 does NOT support <break>. Use:
- Ellipses for hesitation:
It was... a mistake. - Capitalization for emphasis:
It was a VERY long day. - Punctuation for rhythm: commas, periods, dashes
- Audio tags for breath/pause:
[short pause],[long pause],[exhales],[sighs]
General Pacing Guidelines
- Short sentences are more intelligible in TTS than complex compound sentences.
- One idea per sentence — especially important for technical content.
- Paragraph breaks create natural breathing room. Don't cram everything into one block.
- Vary sentence length to avoid robotic cadence. Short. Then a bit longer. Then short again.
6. Audio Tags for Expressive Delivery
For Eleven v3 (or tag-aware TTS models), inject audio tags to control emotion and non-verbal delivery. Tags must describe auditory actions only.
Tag Categories
Emotional Directions:
[happy], [sad], [excited], [angry], [whisper], [annoyed],
[appalled], [thoughtful], [surprised], [sarcastic], [curious],
[mischievously], [professional], [reassuring], [frustrated], [delighted],
[nervously], [cautiously], [cheerfully], [quizzically], [elated],
[deadpan], [dramatically], [dismissive], [impressed], [warmly]
Non-Verbal Sounds:
[laughs], [laughs harder], [starts laughing], [chuckles], [giggles],
[giggling], [groaning], [sighs], [exhales], [exhales sharply],
[inhales deeply], [clears throat], [short pause], [long pause],
[wheezing], [snorts], [gasps], [muttering], [happy gasp]
Sound Effects (use sparingly):
[gunshot], [applause], [clapping], [explosion], [swallows], [gulps],
[record scratch], [binary beeping]
Overall Direction (scene context):
[football], [wrestling match], [auctioneer], [news broadcast],
[podcast studio], [hacker den], [library], [classroom]
Tag Placement Rules
- Before the line for global mood:
[sarcastic] Oh, you thought this was easy? - After the line for reaction:
Another framework. [sighs] - Inline for mid-sentence shifts:
It was working [excited] until it wasn't. [sighs] - At the start of a scene for ambient context:
[podcast studio]or[library] - Do NOT turn narrative descriptions into tags. If the text says "He laughed
loudly," add a tag:
He laughed loudly [chuckles]. - Do NOT use non-auditory tags:
[standing],[grinning],[pacing],[music]
Tag Density Guidelines
| Content Type | Tag Count | Guidance |
|---|---|---|
| Explainer / Tutorial | 4-8 tags | One tag per major section or emotional beat |
| Audiobook / Story | 10-20 tags | Higher density for character emotion and scene shifts |
| Podcast Intro | 2-4 tags | Confidence and energy tags for the hook |
| Product Demo | 3-6 tags | Enthusiasm for features, professionalism for specs |
| Meditation / Calm | 4-6 tags | Soft tags: [whisper], [softly], [gently] |
7. Model Selection Quick Guide
| Model | Best For | Break Tags | Audio Tags | Phoneme Tags |
|---|---|---|---|---|
| Eleven v3 | Expressive narration, character voices | No | Yes | No |
| Multilingual v2 | Natural speech, multiple languages | Yes | No | No |
| Flash v2.5 | Low latency, real-time | No | No | No |
| Flash v2 | Fast generation, English | Yes | No | Yes |
| English v1 | Legacy English content | Yes | No | Yes |
Recommendation: Use v3 for character-driven or expressive content. Use Multilingual v2 for natural, neutral narration in any language.
8. Output Format
Deliver the final script in this structure:
# [Title] — TTS Script
## Metadata
- **Content Type**: [Explainer / Audiobook / Podcast / Demo / etc.]
- **Target Duration**: [e.g., 3-5 minutes]
- **TTS Model**: [Eleven v3 / Multilingual v2 / Flash v2.5]
- **Voice**: [Voice name or description]
- **Audio Tags**: [count]
## Normalized Voiceover Script
[Paste the fully normalized, tag-enhanced script here.]
[Use paragraph breaks for natural breathing room.]
## Pronunciation Notes
- "React": ree-act (not ray-act)
- "Kubernetes": koo-ber-net-ees
- [Any other words that need explicit direction]
## Post-Production Notes (optional)
- Speed adjustment: [1.0x / 1.15x / 1.25x]
- Background music: [genre / tempo / volume relative to voice]
9. Anti-Patterns (NEVER Do)
- Leaving raw symbols in script —
$100,API,2024-01-01,Ctrl+Zmust be normalized. - Long run-on sentences — TTS drifts on complex clauses. Break them up.
- No audio tags on v3 — Flat TTS sounds robotic on expressive models.
- Non-auditory tags —
[grinning],[pacing],[music]will be spoken or ignored. - Too many
<break>tags — Causes instability (speedups, artifacts). Max 2-3. - Using
<break>with v3 — v3 does not support SSML break tags. - Inconsistent pacing — Avoid monotone sentence length. Vary short and long.
- Generic CTAs — "Thanks for watching" and "like and subscribe" are lazy. Write topic-specific closings.
- Ignoring model capabilities — Don't use phoneme tags with v3 or Multilingual v2. They won't work.
- Writing for the eye — Read the script aloud (or imagine it spoken) before delivering. If it sounds awkward, rewrite it.
10. Quality Checklist
Before delivering the script, verify:
- All numbers, symbols, dates, abbreviations normalized for speech
- Acronyms spelled out or phonetically guided
- Audio tags are auditory-only (no
[grinning],[music]) - Audio tags placed strategically before / after / inline
- Tag density appropriate for content type
- Sentences vary in length (not all short, not all long)
- Paragraph breaks create natural breathing room
- No raw code syntax — all technical content spoken-out
- Pronunciation notes included for ambiguous terms
- Model-appropriate pause control (
<break>for v2, ellipses/tags for v3) - Phoneme tags only used with compatible models (Flash v2, English v1)
- CTA or closing is specific to the topic (not generic)
More from jarmen423/skills
frontend-design
Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, artifacts, posters, or applications (examples include websites, landing pages, dashboards, React components, HTML/CSS layouts, or when styling/beautifying any web UI). Generates creative, polished code and UI design that avoids generic AI aesthetics.
15xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
14runpod-serverless
Create serverless endpoint templates and endpoints on RunPod.io. Supports Python/Node.js runtimes, GPU selection (3090, A100, etc.), and idempotent configuration. Use this skill when a user wants to set up a new serverless endpoint or template on RunPod.
13qwen3-tts
Build text-to-speech applications using Qwen3-TTS, a powerful speech generation system supporting voice clone, voice design, and custom voice synthesis. Use when creating TTS applications, generating speech from text, cloning voices from audio samples, designing new voices via natural language descriptions, or fine-tuning TTS models. Supports 10 languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian).
13skill-creator
Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations.
13webapp-testing
Toolkit for interacting with and testing local web applications using Playwright. Supports verifying frontend functionality, debugging UI behavior, capturing browser screenshots, and viewing browser logs.
13