Generate natural-sounding speech from text using Cartesia Sonic 3.

Requirements

curl, python3, ffprobe (optional, for duration)

Setup

Save your Cartesia API key:

mkdir -p ~/.cartesia && echo "YOUR_KEY" > ~/.cartesia/credentials && chmod 600 ~/.cartesia/credentials

Or set CARTESIA_API_KEY env var. Or pass --api-key flag.

Resolution order: --api-key flag > $CARTESIA_API_KEY env var > ~/.cartesia/credentials file.

API details

API: Cartesia (https://api.cartesia.ai)
Model: sonic-3
Docs version header: Cartesia-Version: 2026-03-01 (required on every request)
Auth header: X-API-Key: {key} (NOT Bearer token)

Default voice

Ronald: 5ee9feff-1265-424a-9d7f-8e4d431a12c7

Natural, conversational male voice. Good for narration, briefings, and long-form content.

How to use this skill — the full workflow

TTS is a two-step process: write a script, then generate audio. Never skip the first step. Raw source content (blog posts, docs, notes, case studies) sounds terrible when fed directly to TTS — you can hear the "AI blog post" cadence and it's painful to listen to.

Step 1: Write a spoken script (.txt file)

Before generating audio, ALWAYS rewrite the source material into a conversational script. This is the most important step. Save it as a plain .txt file with no formatting.

The goal: it should sound like a knowledgeable person casually explaining this stuff to you — like a friend briefing you over coffee, not a corporate blog post being read aloud.

Rules for the script:

No headings, no bullet points, no markdown, no structure — just flowing paragraphs of natural speech
Write in first person or second person. "Let me walk you through this" not "This document outlines"
Use contractions: "they're", "it's", "don't", "that's". Nobody speaks without contractions
Spell out numbers: "ninety-five percent" not "95%", "two billion dollars" not "$2B", "seventy-two hours" not "72 hours"
Spell out abbreviations on first use: "Forward Deployed Engineers, or FDEs" not just "FDEs"
Use natural transitions between topics: "Now, the interesting part is...", "Which brings us to...", "So here's where it gets good..."
Vary sentence length. Short punchy sentences mixed with longer flowing ones. Monotonous rhythm is death
Add conversational filler where it feels natural: "honestly", "basically", "the crazy thing is", "and look"
When citing quotes, introduce them naturally: "Their CTO said they evaluated over a dozen vendors" — don't use quotation marks or attribution blocks
When citing stats, weave them into sentences: "They hit seventy-five percent resolution on the first attempt" not "Resolution rate: 75%"
No sign-off or summary paragraph at the end. Just stop when the content is done. Don't wrap up with "So that's..." or "In conclusion..."
Let the content dictate the length. Don't pad and don't artificially truncate. A 3-minute briefing is fine. So is 15 minutes

What to avoid:

Don't preserve the original document's structure — restructure for spoken flow
Don't include section headers read aloud ("Chapter one, the execution gap...")
Don't read out URLs, citation marks, or formatting artifacts
Don't use the word "delve"
Don't be breathlessly enthusiastic. Be measured, thoughtful, and occasionally wry
Don't jump registers without a bridge. This covers several related smells:
- Fact-to-figurative whiplash: "The S and P jumped one percent. Everyone exhaled." — a dry stat followed by a poetic beat with no hinge. Fix: "The S and P jumped one percent, and you could feel the relief."
- Unprepared personification: "Markets loved it." drops in a human emotion for an abstract noun with no setup. Fix: go straight to the data — "The reaction was immediate" — or bridge it naturally.
- Dangling comma clause: "Brent crude dropped more than ten percent, fell below a hundred dollars a barrel." The comma creates a micro-pause that makes the second fact sound like a half-attached afterthought. In writing your eye glides over it; in speech it dangles. Fix: join with "and" ("dropped more than ten percent and fell below...") or make two full sentences.
Don't start with "Hey there!" or any greeting. Just start talking about the subject

Step 2: Generate the audio

./scripts/tts.sh <script.txt> <output.mp3>

Or call the API directly:

curl -s -X POST https://api.cartesia.ai/tts/bytes \
  -H "X-API-Key: $(cat ~/.cartesia/credentials)" \
  -H "Cartesia-Version: 2026-03-01" \
  -H "Content-Type: application/json" \
  -d "$(python3 -c "
import json
with open('script.txt') as f:
    text = f.read()
print(json.dumps({
    'model_id': 'sonic-3',
    'transcript': text,
    'voice': {'mode': 'id', 'id': '5ee9feff-1265-424a-9d7f-8e4d431a12c7'},
    'output_format': {'container': 'mp3', 'bit_rate': 128000, 'sample_rate': 44100, 'encoding': 'mp3'}
}))
")" \
  -o output.mp3

The response body IS the MP3 file. No JSON wrapper. Just pipe or save directly.

Available voices

List voices:

curl -s -H "X-API-Key: $(cat ~/.cartesia/credentials)" \
  -H "Cartesia-Version: 2026-03-01" \
  https://api.cartesia.ai/voices

To use a different voice, pass --voice <id> to the script. You can also use a custom/cloned voice ID directly.

Multiple parts

For longer scripts, split at paragraph breaks and generate each part separately. Concatenate with ffmpeg:

ffmpeg -i "concat:part1.mp3|part2.mp3|part3.mp3" -c copy output.mp3

Uploading

After generating audio, use the airloom skill to upload and get a shareable URL:

~/.agents/skills/airloom/scripts/upload.sh output.mp3 --title "My Audio" --client claude-code