tts
When to Use
- User wants to convert text to spoken audio
- User asks for "read aloud", "TTS", "text to speech", "voice narration"
- User says "朗读", "配音", "语音合成"
- User wants multi-speaker scripted audio or dialogue
When NOT to Use
- User wants a podcast-style discussion with topic exploration (use
/podcast) - User wants an explainer video with visuals (use
/explainer) - User wants to generate an image (use
/image-gen)
Purpose
Convert text into natural-sounding speech audio. Two paths:
- Quick mode (
/v1/tts): Single voice, low-latency, sync MP3 stream. For casual chat, reading snippets, instant audio. - Script mode (
/v1/speech): Multi-speaker, per-segment voice assignment. For dialogue, audiobooks, scripted content.
Hard Constraints
- No shell scripts. Construct curl commands from the API reference files listed in Resources
- Always read
shared/authentication.mdfor API key and headers - Follow
shared/common-patterns.mdfor errors and interaction patterns - Never hardcode speaker IDs — always fetch from the speakers API
- Always read config following
shared/config-pattern.mdbefore any interaction - Always follow
shared/speaker-selection.mdfor speaker selection (text table + free-text input) - Never save files to
~/Downloads/or/tmp/as primary output — use.listenhub/tts/
Mode Detection
Determine the mode from the user's input automatically before asking any questions:
| Signal | Mode |
|---|---|
| "多角色", "脚本", "对话", "script", "dialogue", "multi-speaker" | Script |
| Multiple characters mentioned by name or role | Script |
| Input contains structured segments (A: ..., B: ...) | Script |
| Single paragraph of text, no character markers | Quick |
| "读一下", "read this", "TTS", "朗读" with plain text | Quick |
| Ambiguous | Quick (default) |
Interaction Flow
Step -1: API Key Check
Follow shared/config-pattern.md § API Key Check. If the key is missing, stop immediately.
Step 0: Config Setup
Follow shared/config-pattern.md Step 0.
If file doesn't exist — ask location, then create immediately:
mkdir -p ".listenhub/tts"
echo '{"outputDir":".listenhub","outputMode":"inline","language":null,"defaultSpeakers":{}}' > ".listenhub/tts/config.json"
CONFIG_PATH=".listenhub/tts/config.json"
# (or $HOME/.listenhub/tts/config.json for global)
Then run Setup Flow below.
If file exists — read config, display summary, and confirm:
当前配置 (tts):
输出方式:{inline / download / both}
语言偏好:{zh / en / 未设置}
默认主播:{speakerName / 未设置}
Ask: "使用已保存的配置?" → 确认,直接继续 / 重新配置
Setup Flow (first run or reconfigure)
Ask these questions in order, then save all answers to config at once:
-
outputMode: Follow
shared/output-mode.md§ Setup Flow Question. -
Language (optional): "默认语言?"
- "中文 (zh)"
- "English (en)"
- "每次手动选择" → keep
null
After collecting answers, save immediately:
# Save outputMode; only update language if user picked one
# Follow shared/output-mode.md § Save to Config
NEW_CONFIG=$(echo "$CONFIG" | jq --arg m "$OUTPUT_MODE" '. + {"outputMode": $m}')
# If language was chosen (not "每次手动选择"):
NEW_CONFIG=$(echo "$NEW_CONFIG" | jq --arg lang "zh" '. + {"language": $lang}')
echo "$NEW_CONFIG" > "$CONFIG_PATH"
CONFIG=$(cat "$CONFIG_PATH")
Note: defaultSpeakers are saved after speaker selection in Step 3 — not here.
Quick Mode — POST /v1/tts
Step 1: Extract text
Get the text to convert. If the user hasn't provided it, ask:
"What text would you like me to read aloud?"
Step 2: Determine voice
- If
config.defaultSpeakers.{language}[0]is set → use it silently (skip to Step 4) - Otherwise:
GET /speakers/list?language={detected-language}, then followshared/speaker-selection.md(text table + free-text input)
Step 3: Save preference
Question: "Save {voice name} as your default voice for {language}?"
Options:
- "Yes" — update .listenhub/tts/config.json
- "No" — use for this session only
Step 4: Confirm
Ready to generate:
Text: "{first 80 chars}..."
Voice: {voice name}
Proceed?
Step 5: Generate
curl -sS -X POST "https://api.marswave.ai/openapi/v1/tts" \
-H "Authorization: Bearer $LISTENHUB_API_KEY" \
-H "Content-Type: application/json" \
-d '{"input": "...", "voice": "..."}' \
--output /tmp/tts-output.mp3
Step 6: Present result
Read OUTPUT_MODE from config. Follow shared/output-mode.md for behavior.
Use a timestamped jobId: $(date +%s)
inline or both (TTS quick returns a sync audio stream — no audioUrl):
JOB_ID=$(date +%s)
curl -sS -X POST "https://api.marswave.ai/openapi/v1/tts" \
-H "Authorization: Bearer $LISTENHUB_API_KEY" \
-H "Content-Type: application/json" \
-d '{"input": "...", "voice": "..."}' \
--output /tmp/tts-${JOB_ID}.mp3
Then use the Read tool on /tmp/tts-{jobId}.mp3.
Present:
Audio generated!
download or both:
JOB_ID=$(date +%s)
DATE=$(date +%Y-%m-%d)
JOB_DIR=".listenhub/tts/${DATE}-${JOB_ID}"
mkdir -p "$JOB_DIR"
curl -sS -X POST "https://api.marswave.ai/openapi/v1/tts" \
-H "Authorization: Bearer $LISTENHUB_API_KEY" \
-H "Content-Type: application/json" \
-d '{"input": "...", "voice": "..."}' \
--output "${JOB_DIR}/${JOB_ID}.mp3"
Present:
Audio generated!
已下载到 .listenhub/tts/{YYYY-MM-DD}-{jobId}/:
{jobId}.mp3
Script Mode — POST /v1/speech
Step 1: Get scripts
Determine whether the user already has a scripts array:
-
Already provided (JSON or clear segments): parse and display for confirmation
-
Not yet provided: help the user structure segments. Ask:
"Please provide the script with speaker assignments. Format: each line as
SpeakerName: text content. I'll convert it."Once the user provides the script, parse it into the
scriptsJSON format.
Step 2: Assign voices per character
For each unique character in the script:
- If
config.defaultSpeakers.{language}has saved voices → auto-assign silently (one per character in order) - Otherwise: fetch
GET /speakers/list?language={detected-language}and followshared/speaker-selection.mdfor each character
Step 3: Save preferences
After all voices are assigned (if any were new):
Question: "Save these voice assignments for future sessions?"
Options:
- "Yes" — update defaultSpeakers in .listenhub/tts/config.json
- "No" — use for this session only
Step 4: Confirm
Ready to generate:
Characters:
{name}: {voice}
{name}: {voice}
Segments: {count}
Title: (auto-generated)
Proceed?
Step 5: Generate
Write the request body to a temp file, then submit:
# Write request to temp file
cat > /tmp/lh-speech-request.json << 'ENDJSON'
{
"scripts": [
{"content": "...", "speakerId": "..."},
{"content": "...", "speakerId": "..."}
]
}
ENDJSON
# Submit
curl -sS -X POST "https://api.marswave.ai/openapi/v1/speech" \
-H "Authorization: Bearer $LISTENHUB_API_KEY" \
-H "Content-Type: application/json" \
-d @/tmp/lh-speech-request.json
rm /tmp/lh-speech-request.json
Step 6: Present result
Read OUTPUT_MODE from config. Follow shared/output-mode.md for behavior.
inline or both: Display the audioUrl and subtitlesUrl as clickable links.
Present:
Audio generated!
在线收听:{audioUrl}
字幕:{subtitlesUrl}
时长:{audioDuration / 1000}s
消耗积分:{credits}
download or both: Also download the file.
DATE=$(date +%Y-%m-%d)
JOB_DIR=".listenhub/tts/${DATE}-{jobId}"
mkdir -p "$JOB_DIR"
curl -sS -o "${JOB_DIR}/{jobId}.mp3" "{audioUrl}"
Present the download path in addition to the above summary.
Updating Config
When saving preferences, merge into .listenhub/tts/config.json — do not overwrite unchanged keys.
Follow the merge pattern in shared/config-pattern.md.
- Quick voice: set
defaultSpeakers.{language}[0]to the selectedspeakerId - Script voices: set
defaultSpeakers.{language}to the full array assigned this session - Language: set
languageif the user explicitly specifies it
API Reference
- TTS & Speech endpoints:
shared/api-tts.md - Speaker list:
shared/api-speakers.md - Speaker selection guide:
shared/speaker-selection.md - Error handling:
shared/common-patterns.md§ Error Handling - Long text input:
shared/common-patterns.md§ Long Text Input
Composability
- Invokes: speakers API (for speaker selection)
- Invoked by: explainer (for voiceover)
Examples
Quick mode:
"TTS this: The server will be down for maintenance at midnight."
- Detect: Quick mode (plain text, "TTS this")
- Read config:
quickVoiceisnull - Fetch speakers, user picks "Yuanye"
- Ask to save → yes → update config
POST /v1/ttswithinput+voice- Present:
/tmp/tts-output.mp3
Script mode:
"帮我做一段双人对话配音,A说:欢迎大家,B说:谢谢邀请"
- Detect: Script mode ("双人对话")
- Parse segments: A → "欢迎大家", B → "谢谢邀请"
- Read config:
scriptVoicesempty - Fetch
zhspeakers, assign A and B voices - Ask to save → yes → update config
POST /v1/speechwith scripts array- Present:
audioUrl,subtitlesUrl, duration