dialogue-script-writer
Dialogue Script Writer
Use this skill when the user wants a multi-speaker dialogue script that will be generated with ElevenLabs Text to Dialogue (v3 model). It ensures the script sounds natural and conversational — with proper turn-taking, interruptions, emotional variety, and acoustic cleanliness.
Golden rule: Dialogue is a tennis match, not a monologue rally. Every line should provoke a reaction, advance the topic, or reveal character.
1. Use When
- User says: "dialogue script", "conversation script", "multi-speaker script"
- User mentions: "Text to Dialogue", "two people talking", "interview script", "podcast dialogue", "dramatic scene", "character conversation"
- Content types: podcasts, interviews, debates, dramatic scenes, tutorials in dialogue format, customer service simulations, audiobook dialogue
- Any script targeting ElevenLabs Text to Dialogue API or similar multi-speaker TTS pipeline
2. Speaker Dynamics
Defining Characters
Before writing, establish each speaker's voice:
| Attribute | Host A | Host B |
|---|---|---|
| Role | Expert / Guide / Skeptic | Learner / Challenger / Optimist |
| Tone | Dry, measured, authoritative | Energetic, curious, reactive |
| Speech Pattern | Short, declarative sentences | Questions, exclamations, interruptions |
| Audio Tag Signature | [deadpan], [thoughtful], [sarcastic] |
[excited], [surprised], [laughing] |
For 3+ speakers: Assign each a distinct role and emotional register so the listener can tell them apart even without visual cues.
Chemistry Rules
- No monologues longer than 15 seconds. Cut to another speaker for a reaction, question, or punchline.
- Reactions are content. "Wait, WHAT?" is as valuable as the explanation that provokes it.
- Interruptions create realism. People don't wait for perfect pauses.
- Let speakers disagree. Conflict is more engaging than consensus.
- Shared moments land harder. Both speakers laughing at the same joke hits harder than one person laughing alone.
3. Dialogue Mechanics
Interruptions
Use dashes and audio tags to create natural cut-ins.
Speaker 1: [starting to speak] So the way it works is—
Speaker 2: [jumping in] —wait, is this the part where you explain callbacks?
Speaker 1: [sighs] Unfortunately, yes.
Call and Response
One speaker drops information; the other reacts. The reaction is the beat.
Speaker 1: [deadpan] The database is down.
Speaker 2: [alarmed] AGAIN?!
Speaker 1: [sarcastic] Third time this week. We're calling it "intermittent uptime."
Tag Team Explanation
Split one concept across speakers. Faster and more engaging than a solo lecture.
Speaker 1: A closure is when a function remembers the scope it was created in—
Speaker 2: [excited] —even after that scope has finished executing!
Speaker 1: [impressed] Exactly. You've been paying attention.
Devil's Advocate
One speaker raises an objection so the other can demolish it. The viewer learns through conflict.
Speaker 2: [questioning] But do we really need types? JavaScript was fine for years.
Speaker 1: [appalled] Fine? [exhales sharply] You debug undefined at three AM and tell me it was fine.
Trailing Thoughts
Use ellipses to indicate hesitation, trailing off, or unfinished ideas.
Speaker 1: [indecisive] So I was thinking maybe we could... uhhh...
Speaker 2: [quizzically] Use a different framework?
Speaker 1: [relieved] Yes! Thank you.
Overlapping Intent
Simulate people talking over each other without actual audio overlap issues.
Speaker 1: [starting to speak] If we refactor the—
Speaker 2: [overlapping] —the whole module?
Speaker 1: [pause] I was going to say "the config." But sure, let's burn it all down.
4. Script Structure Templates
A. Interview Format (5-15 min)
- Intro (0:00-0:30): Host introduces guest + topic. One sentence each.
- Background (0:30-2:00): Guest's origin story. Host asks follow-ups.
- Core Topic (2:00-8:00): 3-5 questions. Each answer provokes a reaction.
- Hot Take (8:00-10:00): Devil's advocate question. Lively debate.
- Rapid Fire (10:00-12:00): Short questions, shorter answers.
- Wrap (12:00-15:00): One lesson + where to find guest.
B. Podcast Episode (20-40 min)
- Cold Open (0:00-1:00): Best clip from the episode. Hooks the listener.
- Intro (1:00-2:00): Music + show identity. Host energy sets the tone.
- Segment 1: News / Topic (2:00-10:00): Hosts trade takes. No monologues.
- Segment 2: Deep Dive (10:00-25:00): One concept explored through dialogue.
- Segment 3: Listener Mail / Q&A (25:00-35:00): Hosts read questions aloud, then debate answers.
- Outro (35:00-40:00): Summary + CTA. Specific, not generic.
C. Dramatic Scene (2-5 min)
- Establishment: Setting + relationship. Audio tags set mood.
- Inciting Incident: Something changes. One speaker drops a bombshell.
- Rising Tension: Disagreement escalates. Shorter turns, sharper tags.
- Climax: Emotional peak.
[angry],[desperately],[shouting]. - Resolution: One speaker concedes, or they agree to disagree.
- Button: Final line that lands the scene. Often ironic or bittersweet.
D. Tutorial in Dialogue (3-8 min)
- Problem Setup: Host B describes the struggle. Host A listens.
- Concept Introduction: Host A explains. Host B asks "dumb" questions.
- Walkthrough: Step-by-step, alternating speakers.
- Gotcha Moment: Host B tries it wrong. Host A corrects with a joke.
- Wrap: Both confirm understanding. Shared victory.
5. TTS Text Normalization for Dialogue
Every speaker's lines must be normalized BEFORE audio tags are applied. TTS errors in dialogue are especially jarring because they break the conversational flow.
Normalization Table
| Raw Input | Spoken Form |
|---|---|
$42.50 |
forty-two dollars and fifty cents |
£1,001.32 |
one thousand and one pounds and thirty-two pence |
1234 |
one thousand two hundred thirty-four |
3.14 |
three point one four |
555-555-5555 |
five five five, five five five, five five five five |
2nd |
second |
XIV |
fourteen (or "the fourteenth" if a title) |
Dr. |
Doctor (but saints: "St. Patrick" stays) |
Ave. |
Avenue |
St. |
Street |
Ctrl + Z |
control z |
100km |
one hundred kilometers |
100% |
one hundred percent |
elevenlabs.io/docs |
eleven labs dot io slash docs |
2024-01-01 |
January first, two-thousand twenty-four |
14:30 |
two thirty PM |
API |
A-P-I or "application programming interface" |
HTML |
H-T-M-L |
npm |
N-P-M |
JSON |
J-son or "Jay-sawn" |
Dialogue-Specific Normalization
- Code snippets: Read as spoken descriptions.
- Bad:
const x = useState(0) - Good: "const x equals use state zero"
- Bad:
- File paths: Spell separators.
src/components/Button.tsx→ "src slash components slash button dot tee-ess-ex" - URLs in conversation: If hosts joke about a URL, normalize it fully. Nothing kills a punchline like a robot saying "h-t-t-p-s colon slash slash."
6. Audio Tags for Expressive Dialogue
Eleven v3 Text to Dialogue uses square-bracket tags for emotion, delivery, non-speech audio events, and scene context. Tags must describe auditory actions only.
Tag Categories
Emotional Directions:
[happy], [sad], [excited], [angry], [whisper], [annoyed],
[appalled], [thoughtful], [surprised], [sarcastic], [curious],
[mischievously], [professional], [reassuring], [frustrated], [delighted],
[deadpan], [cautiously], [cheerfully], [quizzically], [elated],
[nervously], [alarmed], [sheepishly], [stifling laughter], [cracking up],
[desperately], [panicking], [warmly], [impressed], [dismissive],
[dramatically], [with genuine belly laugh], [robotic voice], [singing]
Non-Verbal Sounds:
[laughs], [laughs harder], [starts laughing], [chuckles], [giggles],
[giggling], [groaning], [sighs], [exhales], [exhales sharply],
[inhales deeply], [clears throat], [short pause], [long pause],
[wheezing], [snorts], [gasps], [muttering], [happy gasp]
Audio Events / Environment:
[leaves rustling], [gentle footsteps], [applause], [clapping],
[gunshot], [explosion], [swallows], [gulps], [record scratch],
[binary beeping]
Overall Direction (scene context):
[football], [wrestling match], [auctioneer], [news broadcast],
[podcast studio], [hacker den], [classroom], [coffee shop]
Tag Placement Rules
- Before the line for mood:
[sarcastic] Oh, you thought this was easy? - After the line for reaction:
Another framework. [sighs] - Inline for mid-sentence shifts:
It was working [excited] until it wasn't. - Interrupt tags for overlapping intent:
Speaker 1: [starting to speak] So I was thinking— Speaker 2: [jumping in] —that we should rewrite everything? - Scene context tags at the start of a chunk:
[podcast studio]or[coffee shop] - Do NOT use non-auditory tags:
[standing],[grinning],[pacing],[music] - Do NOT turn narrative into tags. If text says "He laughed," add a tag:
He laughed [chuckles].
Tag Density Guidelines
| Content Type | Tags per Speaker | Guidance |
|---|---|---|
| Interview | 4-8 | Professional but expressive |
| Podcast | 6-12 | Higher energy, more reactions |
| Dramatic Scene | 10-20 | Full emotional arc per speaker |
| Tutorial Dialogue | 3-6 | Clear and helpful, occasional humor |
| Customer Simulation | 4-6 | [professional], [sympathetic], [reassuring] |
7. Pacing & Pause Control (v3 Specific)
Eleven v3 does NOT support <break> tags. Use these instead:
| Technique | Effect | Example |
|---|---|---|
| Ellipses | Hesitation, trailing thought | [indecisive] I was thinking maybe... uhhh... |
| Capitalization | Emphasis, stress | It was a VERY long day. |
| Dashes | Interruption, abrupt stop | Wait — you're telling me it works? |
| Commas | Brief rhythmic pause | Okay, so, the way it works is... |
| Periods | Hard stop, new beat | Short sentences. Clear ideas. |
| Audio tags | Breath, pause, non-verbal | [short pause], [exhales], [sighs] |
Pacing Rules for Dialogue
- Average turn length: 3-8 seconds. Never let one speaker hold the floor longer than 10-15 seconds.
- Reaction beats: The other speaker should respond within 1-2 seconds of a punchline or key point.
- Vary turn length: Mix short reactions ("Wait, WHAT?") with longer explanations to create rhythm.
- Silence as a beat: Dead silence after an absurd statement can be
comedic — but mark it with
[long pause]or a scene tag.
8. API Constraints & Chunking
ElevenLabs Text to Dialogue has specific limits:
| Constraint | Limit |
|---|---|
| Model | Eleven v3 ONLY |
| Characters per request | Max 2,000 total across ALL speakers |
| Speakers | Unlimited, but 2-3 recommended for clarity |
| Determinism | Nondeterministic — use seed parameter for consistency |
| Free regenerations | 2 per generation (dashboard only, same params) |
Chunking Strategy
If the full script exceeds 2,000 characters:
- Split at natural scene boundaries, topic shifts, or hard cuts.
- End each chunk on a mini-cliffhanger, reaction beat, or emotional peak.
- Use the same
seedvalue if regenerating a chunk for consistency. - Concatenate audio in post-production with hard cuts.
Quick reference: A 5-minute interview at conversational pace is usually 4-6 chunks.
9. Output Format
Deliver the final script in this structure:
# [Title] — Dialogue Script
## Metadata
- **Content Type**: [Interview / Podcast / Dramatic Scene / Tutorial Dialogue]
- **Target Duration**: [e.g., 5-10 minutes]
- **Model**: Eleven v3 (Text to Dialogue)
- **Speakers**: [count]
- **Chunks**: [1-N]
- **Scene Context**: [podcast studio / coffee shop / classroom / etc.]
## Speaker Definitions
- **Speaker 1 (Alex)**: [Role, tone, tag signature]
- **Speaker 2 (Jordan)**: [Role, tone, tag signature]
- **Speaker 3 (optional)**: [Role, tone, tag signature]
## Dialogue Script
### Chunk 1 (chars: X / 2000)
[Scene context tag if applicable]
Speaker 1: [deadpan] The deployment failed.
Speaker 2: [alarmed] Again?!
Speaker 1: [sarcastic] Third time this week.
...
### Chunk 2 (chars: X / 2000)
Speaker 2: [starting to speak] So what you're saying is—
Speaker 1: [jumping in] —that we should have used Kubernetes?
...
## Pronunciation Notes
- "React": ree-act
- "Kubernetes": koo-ber-net-ees
- [Any ambiguous terms]
## Post-Production Notes
- Compress all voices equally (2:1 ratio, -18dB threshold)
- EQ: slight high-mid boost (+2-3dB at 3-5kHz) for clarity
- Hard cuts between chunks — no crossfade
- Optional: background ambience per scene context
10. Anti-Patterns (NEVER Do)
- One-speaker monologues — If a speaker talks for more than 15 seconds, insert a reaction, question, or interruption.
- "So, welcome to the show" intros — Start with conflict, curiosity, or a provocative statement.
- Leaving raw symbols —
$100,API,npm,2024-01-01must be normalized. - No audio tags — Flat dialogue sounds like robots reading a transcript.
- Non-auditory tags —
[grinning],[pacing],[music]will be spoken or fail. - Consensus-only dialogue — Agreement is boring. Let speakers disagree, misunderstand, or one-up each other.
- Ignoring the 2000-char limit — The API will reject or truncate. Chunk proactively.
- Identical voices — If using the API, assign distinct voices to each speaker. Same voice = confusing dialogue.
- Fade-transition language — Script should feel like hard cuts between takes.
- Using
<break>tags — v3 does not support SSML break tags. Use ellipses and audio tags only.
11. Quality Checklist
Before delivering the script, verify:
- Hook or conflict in first 3 lines
- All numbers, symbols, dates, abbreviations normalized for speech
- Acronyms spelled out or phonetically guided
- No speaker monologue exceeds 15 seconds uninterrupted
- Audio tags are auditory-only (no
[grinning],[music]) - At least 2 interruptions or overlapping moments per chunk
- At least 1 reaction beat per key point
- Turn ratio is balanced — no single speaker dominates
- Tag density appropriate for content type
- Scene context tag used at start of each chunk if applicable
- Script chunked if total characters exceed 2,000
- Each chunk ends on a natural beat (not mid-sentence)
- No
<break>tags used (v3 does not support them) - Pronunciation notes included for ambiguous terms
- Speaker definitions clearly distinguish voice and personality
More from jarmen423/skills
frontend-design
Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, artifacts, posters, or applications (examples include websites, landing pages, dashboards, React components, HTML/CSS layouts, or when styling/beautifying any web UI). Generates creative, polished code and UI design that avoids generic AI aesthetics.
15xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
14runpod-serverless
Create serverless endpoint templates and endpoints on RunPod.io. Supports Python/Node.js runtimes, GPU selection (3090, A100, etc.), and idempotent configuration. Use this skill when a user wants to set up a new serverless endpoint or template on RunPod.
13qwen3-tts
Build text-to-speech applications using Qwen3-TTS, a powerful speech generation system supporting voice clone, voice design, and custom voice synthesis. Use when creating TTS applications, generating speech from text, cloning voices from audio samples, designing new voices via natural language descriptions, or fine-tuning TTS models. Supports 10 languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian).
13skill-creator
Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations.
13webapp-testing
Toolkit for interacting with and testing local web applications using Playwright. Supports verifying frontend functionality, debugging UI behavior, capturing browser screenshots, and viewing browser logs.
13