dialogue-script-writer

Installation

SKILL.md

Dialogue Script Writer

Use this skill when the user wants a multi-speaker dialogue script that will be generated with ElevenLabs Text to Dialogue (v3 model). It ensures the script sounds natural and conversational — with proper turn-taking, interruptions, emotional variety, and acoustic cleanliness.

Golden rule: Dialogue is a tennis match, not a monologue rally. Every line should provoke a reaction, advance the topic, or reveal character.

1. Use When

User says: "dialogue script", "conversation script", "multi-speaker script"
User mentions: "Text to Dialogue", "two people talking", "interview script", "podcast dialogue", "dramatic scene", "character conversation"
Content types: podcasts, interviews, debates, dramatic scenes, tutorials in dialogue format, customer service simulations, audiobook dialogue
Any script targeting ElevenLabs Text to Dialogue API or similar multi-speaker TTS pipeline

2. Speaker Dynamics

Defining Characters

Before writing, establish each speaker's voice:

Attribute	Host A	Host B
Role	Expert / Guide / Skeptic	Learner / Challenger / Optimist
Tone	Dry, measured, authoritative	Energetic, curious, reactive
Speech Pattern	Short, declarative sentences	Questions, exclamations, interruptions
Audio Tag Signature	`[deadpan]`, `[thoughtful]`, `[sarcastic]`	`[excited]`, `[surprised]`, `[laughing]`

For 3+ speakers: Assign each a distinct role and emotional register so the listener can tell them apart even without visual cues.

Chemistry Rules

No monologues longer than 15 seconds. Cut to another speaker for a reaction, question, or punchline.
Reactions are content. "Wait, WHAT?" is as valuable as the explanation that provokes it.
Interruptions create realism. People don't wait for perfect pauses.
Let speakers disagree. Conflict is more engaging than consensus.
Shared moments land harder. Both speakers laughing at the same joke hits harder than one person laughing alone.

3. Dialogue Mechanics

Interruptions

Use dashes and audio tags to create natural cut-ins.

Speaker 1: [starting to speak] So the way it works is—
Speaker 2: [jumping in] —wait, is this the part where you explain callbacks?
Speaker 1: [sighs] Unfortunately, yes.

Call and Response

One speaker drops information; the other reacts. The reaction is the beat.

Speaker 1: [deadpan] The database is down.
Speaker 2: [alarmed] AGAIN?!
Speaker 1: [sarcastic] Third time this week. We're calling it "intermittent uptime."

Tag Team Explanation

Split one concept across speakers. Faster and more engaging than a solo lecture.

Speaker 1: A closure is when a function remembers the scope it was created in—
Speaker 2: [excited] —even after that scope has finished executing!
Speaker 1: [impressed] Exactly. You've been paying attention.

Devil's Advocate

One speaker raises an objection so the other can demolish it. The viewer learns through conflict.

Speaker 2: [questioning] But do we really need types? JavaScript was fine for years.
Speaker 1: [appalled] Fine? [exhales sharply] You debug undefined at three AM and tell me it was fine.

Trailing Thoughts

Use ellipses to indicate hesitation, trailing off, or unfinished ideas.

Speaker 1: [indecisive] So I was thinking maybe we could... uhhh...
Speaker 2: [quizzically] Use a different framework?
Speaker 1: [relieved] Yes! Thank you.

Overlapping Intent

Simulate people talking over each other without actual audio overlap issues.

Speaker 1: [starting to speak] If we refactor the—
Speaker 2: [overlapping] —the whole module?
Speaker 1: [pause] I was going to say "the config." But sure, let's burn it all down.

4. Script Structure Templates

A. Interview Format (5-15 min)

Intro (0:00-0:30): Host introduces guest + topic. One sentence each.
Background (0:30-2:00): Guest's origin story. Host asks follow-ups.
Core Topic (2:00-8:00): 3-5 questions. Each answer provokes a reaction.
Hot Take (8:00-10:00): Devil's advocate question. Lively debate.
Rapid Fire (10:00-12:00): Short questions, shorter answers.
Wrap (12:00-15:00): One lesson + where to find guest.

B. Podcast Episode (20-40 min)

Cold Open (0:00-1:00): Best clip from the episode. Hooks the listener.
Intro (1:00-2:00): Music + show identity. Host energy sets the tone.
Segment 1: News / Topic (2:00-10:00): Hosts trade takes. No monologues.
Segment 2: Deep Dive (10:00-25:00): One concept explored through dialogue.
Segment 3: Listener Mail / Q&A (25:00-35:00): Hosts read questions aloud, then debate answers.
Outro (35:00-40:00): Summary + CTA. Specific, not generic.

C. Dramatic Scene (2-5 min)

Establishment: Setting + relationship. Audio tags set mood.
Inciting Incident: Something changes. One speaker drops a bombshell.
Rising Tension: Disagreement escalates. Shorter turns, sharper tags.
Climax: Emotional peak. [angry], [desperately], [shouting].
Resolution: One speaker concedes, or they agree to disagree.
Button: Final line that lands the scene. Often ironic or bittersweet.

D. Tutorial in Dialogue (3-8 min)

Problem Setup: Host B describes the struggle. Host A listens.
Concept Introduction: Host A explains. Host B asks "dumb" questions.
Walkthrough: Step-by-step, alternating speakers.
Gotcha Moment: Host B tries it wrong. Host A corrects with a joke.
Wrap: Both confirm understanding. Shared victory.

5. TTS Text Normalization for Dialogue

Every speaker's lines must be normalized BEFORE audio tags are applied. TTS errors in dialogue are especially jarring because they break the conversational flow.

Normalization Table

Raw Input	Spoken Form
`$42.50`	forty-two dollars and fifty cents
`£1,001.32`	one thousand and one pounds and thirty-two pence
`1234`	one thousand two hundred thirty-four
`3.14`	three point one four
`555-555-5555`	five five five, five five five, five five five five
`2nd`	second
`XIV`	fourteen (or "the fourteenth" if a title)
`Dr.`	Doctor (but saints: "St. Patrick" stays)
`Ave.`	Avenue
`St.`	Street
`Ctrl + Z`	control z
`100km`	one hundred kilometers
`100%`	one hundred percent
`elevenlabs.io/docs`	eleven labs dot io slash docs
`2024-01-01`	January first, two-thousand twenty-four
`14:30`	two thirty PM
`API`	A-P-I or "application programming interface"
`HTML`	H-T-M-L
`npm`	N-P-M
`JSON`	J-son or "Jay-sawn"

Dialogue-Specific Normalization

Code snippets: Read as spoken descriptions.
- Bad: const x = useState(0)
- Good: "const x equals use state zero"
File paths: Spell separators. src/components/Button.tsx → "src slash components slash button dot tee-ess-ex"
URLs in conversation: If hosts joke about a URL, normalize it fully. Nothing kills a punchline like a robot saying "h-t-t-p-s colon slash slash."

6. Audio Tags for Expressive Dialogue

Eleven v3 Text to Dialogue uses square-bracket tags for emotion, delivery, non-speech audio events, and scene context. Tags must describe auditory actions only.

Tag Categories

Emotional Directions: [happy], [sad], [excited], [angry], [whisper], [annoyed], [appalled], [thoughtful], [surprised], [sarcastic], [curious], [mischievously], [professional], [reassuring], [frustrated], [delighted], [deadpan], [cautiously], [cheerfully], [quizzically], [elated], [nervously], [alarmed], [sheepishly], [stifling laughter], [cracking up], [desperately], [panicking], [warmly], [impressed], [dismissive], [dramatically], [with genuine belly laugh], [robotic voice], [singing]

Non-Verbal Sounds: [laughs], [laughs harder], [starts laughing], [chuckles], [giggles], [giggling], [groaning], [sighs], [exhales], [exhales sharply], [inhales deeply], [clears throat], [short pause], [long pause], [wheezing], [snorts], [gasps], [muttering], [happy gasp]

Audio Events / Environment: [leaves rustling], [gentle footsteps], [applause], [clapping], [gunshot], [explosion], [swallows], [gulps], [record scratch], [binary beeping]

Overall Direction (scene context): [football], [wrestling match], [auctioneer], [news broadcast], [podcast studio], [hacker den], [classroom], [coffee shop]

Tag Placement Rules

Before the line for mood: [sarcastic] Oh, you thought this was easy?
After the line for reaction: Another framework. [sighs]
Inline for mid-sentence shifts: It was working [excited] until it wasn't.

Interrupt tags for overlapping intent:

Speaker 1: [starting to speak] So I was thinking—
Speaker 2: [jumping in] —that we should rewrite everything?

Scene context tags at the start of a chunk: [podcast studio] or [coffee shop]
Do NOT use non-auditory tags: [standing], [grinning], [pacing], [music]
Do NOT turn narrative into tags. If text says "He laughed," add a tag: He laughed [chuckles].

Tag Density Guidelines

Content Type	Tags per Speaker	Guidance
Interview	4-8	Professional but expressive
Podcast	6-12	Higher energy, more reactions
Dramatic Scene	10-20	Full emotional arc per speaker
Tutorial Dialogue	3-6	Clear and helpful, occasional humor
Customer Simulation	4-6	`[professional]`, `[sympathetic]`, `[reassuring]`

7. Pacing & Pause Control (v3 Specific)

Eleven v3 does NOT support <break> tags. Use these instead:

Technique	Effect	Example
Ellipses	Hesitation, trailing thought	`[indecisive] I was thinking maybe... uhhh...`
Capitalization	Emphasis, stress	`It was a VERY long day.`
Dashes	Interruption, abrupt stop	`Wait — you're telling me it works?`
Commas	Brief rhythmic pause	`Okay, so, the way it works is...`
Periods	Hard stop, new beat	Short sentences. Clear ideas.
Audio tags	Breath, pause, non-verbal	`[short pause]`, `[exhales]`, `[sighs]`

Pacing Rules for Dialogue

Average turn length: 3-8 seconds. Never let one speaker hold the floor longer than 10-15 seconds.
Reaction beats: The other speaker should respond within 1-2 seconds of a punchline or key point.
Vary turn length: Mix short reactions ("Wait, WHAT?") with longer explanations to create rhythm.
Silence as a beat: Dead silence after an absurd statement can be comedic — but mark it with [long pause] or a scene tag.

8. API Constraints & Chunking

ElevenLabs Text to Dialogue has specific limits:

Constraint	Limit
Model	Eleven v3 ONLY
Characters per request	Max 2,000 total across ALL speakers
Speakers	Unlimited, but 2-3 recommended for clarity
Determinism	Nondeterministic — use `seed` parameter for consistency
Free regenerations	2 per generation (dashboard only, same params)

Chunking Strategy

If the full script exceeds 2,000 characters:

Split at natural scene boundaries, topic shifts, or hard cuts.
End each chunk on a mini-cliffhanger, reaction beat, or emotional peak.
Use the same seed value if regenerating a chunk for consistency.
Concatenate audio in post-production with hard cuts.

Quick reference: A 5-minute interview at conversational pace is usually 4-6 chunks.

9. Output Format

Deliver the final script in this structure:

# [Title] — Dialogue Script

## Metadata
- **Content Type**: [Interview / Podcast / Dramatic Scene / Tutorial Dialogue]
- **Target Duration**: [e.g., 5-10 minutes]
- **Model**: Eleven v3 (Text to Dialogue)
- **Speakers**: [count]
- **Chunks**: [1-N]
- **Scene Context**: [podcast studio / coffee shop / classroom / etc.]

## Speaker Definitions
- **Speaker 1 (Alex)**: [Role, tone, tag signature]
- **Speaker 2 (Jordan)**: [Role, tone, tag signature]
- **Speaker 3 (optional)**: [Role, tone, tag signature]

## Dialogue Script

### Chunk 1 (chars: X / 2000)
[Scene context tag if applicable]

Speaker 1: [deadpan] The deployment failed.
Speaker 2: [alarmed] Again?!
Speaker 1: [sarcastic] Third time this week.
...

### Chunk 2 (chars: X / 2000)

Speaker 2: [starting to speak] So what you're saying is—
Speaker 1: [jumping in] —that we should have used Kubernetes?
...

## Pronunciation Notes
- "React": ree-act
- "Kubernetes": koo-ber-net-ees
- [Any ambiguous terms]

## Post-Production Notes
- Compress all voices equally (2:1 ratio, -18dB threshold)
- EQ: slight high-mid boost (+2-3dB at 3-5kHz) for clarity
- Hard cuts between chunks — no crossfade
- Optional: background ambience per scene context

10. Anti-Patterns (NEVER Do)

One-speaker monologues — If a speaker talks for more than 15 seconds, insert a reaction, question, or interruption.
"So, welcome to the show" intros — Start with conflict, curiosity, or a provocative statement.
Leaving raw symbols — $100, API, npm, 2024-01-01 must be normalized.
No audio tags — Flat dialogue sounds like robots reading a transcript.
Non-auditory tags — [grinning], [pacing], [music] will be spoken or fail.
Consensus-only dialogue — Agreement is boring. Let speakers disagree, misunderstand, or one-up each other.
Ignoring the 2000-char limit — The API will reject or truncate. Chunk proactively.
Identical voices — If using the API, assign distinct voices to each speaker. Same voice = confusing dialogue.
Fade-transition language — Script should feel like hard cuts between takes.
Using <break> tags — v3 does not support SSML break tags. Use ellipses and audio tags only.

11. Quality Checklist

Before delivering the script, verify:

Related skills

More from jarmen423/skills

Installs

Repository

jarmen423/skills

GitHub Stars

First Seen

Apr 23, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykPass