fish-audio-api
Fish Audio Raw API Skill
Use this skill to generate correct, runnable Fish Audio API calls without any SDK. The canonical machine-readable sources are:
- REST:
https://docs.fish.audio/api-reference/openapi.json - WebSocket:
https://docs.fish.audio/api-reference/asyncapi.yml
This file condenses those into rules an agent can apply directly.
Global facts
- Base URL:
https://api.fish.audio - WebSocket base:
wss://api.fish.audio - Auth (all endpoints):
Authorization: Bearer <FISH_API_KEY> - Get API keys:
https://fish.audio/app/api-keys - Never hardcode keys — read from an env var like
FISH_API_KEY. - Errors are JSON
{status, message}for 401 / 402 / 404, and an array of{loc, type, msg, ctx, in}for 422 (validation).
Endpoint map
| Method | Path | Purpose |
|---|---|---|
| POST | /v1/tts |
Text-to-Speech (streams audio bytes) |
| POST | /v1/asr |
Speech-to-Text (returns JSON transcript) |
| GET | /model |
List voice models |
| POST | /model |
Create voice model (voice cloning) |
| GET | /model/{id} |
Get voice model metadata |
| PATCH | /model/{id} |
Update voice model |
| DELETE | /model/{id} |
Delete voice model |
| GET | /wallet/{user_id}/package |
Subscription package info (user_id defaults to self) |
| GET | /wallet/{user_id}/api-credit |
API credit balance (user_id defaults to self) |
| WSS | /v1/tts/live |
Real-time TTS streaming (MessagePack frames) |
Text-to-Speech — POST /v1/tts
Required headers:
Authorization: Bearer <FISH_API_KEY>Content-Type: application/jsonorapplication/msgpackmodel: s2-pro(required). Values:s1,s2-pro. Default tos2-prounless the user explicitly asks otherwise.
Response: streaming audio bytes (Transfer-Encoding: chunked) in the format set by format. Write to a file or pipe to a player. There is no JSON wrapper on success.
Request body fields (TTSRequest)
| Field | Type | Default | Notes |
|---|---|---|---|
text |
string | — (required) | The text to synthesize. Use speaker tags <|speaker:0|>, <|speaker:1|> for multi-speaker. |
reference_id |
string | string[] | null | null | Voice model ID. Array = multi-speaker (S2-Pro only). |
references |
ReferenceAudio[] | ReferenceAudio[][] | null | null | Inline zero-shot cloning samples. Requires application/msgpack because audio is raw bytes. 2D array for multi-speaker. |
temperature |
number 0–1 | 0.7 | Expressiveness. |
top_p |
number 0–1 | 0.7 | Nucleus sampling. |
prosody.speed |
number 0.5–2 | 1 | Playback speed. |
prosody.volume |
number (dB) | 0 | Loudness offset. |
prosody.normalize_loudness |
bool | true | S2-Pro only. |
chunk_length |
int 100–300 | 300 | Text segment size. |
min_chunk_length |
int 0–100 | 50 | Min chars before a new chunk. |
normalize |
bool | true | Normalize numbers/etc. for EN/ZH. |
format |
wav | pcm | mp3 | opus |
mp3 |
Output format. |
sample_rate |
int | null | null (44100, or 48000 for opus) | Output sample rate. |
mp3_bitrate |
64 | 128 | 192 | 128 | Only when format=mp3. |
opus_bitrate |
-1000 | 24 | 32 | 48 | 64 | -1000 (auto) | Only when format=opus. |
latency |
low | normal | balanced |
normal |
Quality vs latency. |
max_new_tokens |
int | 1024 | Per-chunk audio token cap. |
repetition_penalty |
number | 1.2 | >1.0 reduces repeats. |
condition_on_previous_chunks |
bool | true | Cross-chunk voice consistency. |
early_stop_threshold |
number 0–1 | 1.0 | Batch early-stop. |
ReferenceAudio = { audio: <raw bytes>, text: <transcript string> }. 10–30 s of clean speech works best.
Voice source rules
- Library / custom voice model → set
reference_idto the model_id. Simplest path. - Zero-shot from audio → set
references(array of{audio, text}) and use MessagePack body. JSON cannot carry raw audio bytes. - Multi-speaker dialogue (S2-Pro only) →
reference_id: [id0, id1, ...]and embed<|speaker:0|>/<|speaker:1|>markers insidetext. For zero-shot multi-speaker,referencesis an array-of-arrays, one inner array per speaker.
Single-speaker curl
curl --request POST https://api.fish.audio/v1/tts \
--header "Authorization: Bearer $FISH_API_KEY" \
--header "Content-Type: application/json" \
--header "model: s2-pro" \
--data '{
"text": "Hello! Welcome to Fish Audio.",
"reference_id": "<voice-model-id>",
"format": "mp3",
"mp3_bitrate": 128,
"latency": "normal"
}' \
--output out.mp3
Multi-speaker curl (S2-Pro)
curl --request POST https://api.fish.audio/v1/tts \
--header "Authorization: Bearer $FISH_API_KEY" \
--header "Content-Type: application/json" \
--header "model: s2-pro" \
--data '{
"text": "<|speaker:0|>Good morning!<|speaker:1|>Good morning! How are you?",
"reference_id": ["<speaker-0-id>", "<speaker-1-id>"],
"format": "mp3"
}' \
--output dialogue.mp3
Python (no SDK, streaming to file)
import os, httpx
payload = {
"text": "Hello from Fish Audio.",
"reference_id": "<voice-model-id>",
"format": "mp3",
"latency": "normal",
}
headers = {
"Authorization": f"Bearer {os.environ['FISH_API_KEY']}",
"Content-Type": "application/json",
"model": "s2-pro",
}
with httpx.stream("POST", "https://api.fish.audio/v1/tts",
headers=headers, json=payload, timeout=None) as r:
r.raise_for_status()
with open("out.mp3", "wb") as f:
for chunk in r.iter_bytes():
f.write(chunk)
Python with inline references (MessagePack)
import os, httpx, msgpack
with open("sample.wav", "rb") as f:
ref_audio = f.read()
payload = {
"text": "Clone this voice and say this line.",
"references": [{"audio": ref_audio, "text": "Transcript of sample.wav."}],
"format": "mp3",
}
headers = {
"Authorization": f"Bearer {os.environ['FISH_API_KEY']}",
"Content-Type": "application/msgpack",
"model": "s2-pro",
}
body = msgpack.packb(payload, use_bin_type=True)
with httpx.stream("POST", "https://api.fish.audio/v1/tts",
headers=headers, content=body, timeout=None) as r:
r.raise_for_status()
with open("out.mp3", "wb") as f:
for chunk in r.iter_bytes():
f.write(chunk)
Node.js (fetch, streaming)
import { createWriteStream } from "node:fs";
import { Readable } from "node:stream";
import { pipeline } from "node:stream/promises";
const res = await fetch("https://api.fish.audio/v1/tts", {
method: "POST",
headers: {
Authorization: `Bearer ${process.env.FISH_API_KEY}`,
"Content-Type": "application/json",
model: "s2-pro",
},
body: JSON.stringify({
text: "Hello from Fish Audio.",
reference_id: "<voice-model-id>",
format: "mp3",
latency: "normal",
}),
});
if (!res.ok) throw new Error(`${res.status} ${await res.text()}`);
await pipeline(Readable.fromWeb(res.body), createWriteStream("out.mp3"));
Speech-to-Text — POST /v1/asr
Required headers: Authorization. Content type: multipart/form-data or application/msgpack.
Form fields:
audio(binary, required)language(string, optional; omit to auto-detect)ignore_timestamps(bool, defaulttrue; setfalseto get per-segment timestamps — adds latency on clips < 30 s)
Response (200):
{
"text": "full transcript",
"duration": 12.34,
"segments": [{"text": "...", "start": 0.0, "end": 1.23}]
}
curl
curl --request POST https://api.fish.audio/v1/asr \
--header "Authorization: Bearer $FISH_API_KEY" \
--form "audio=@input.wav" \
--form "language=en" \
--form "ignore_timestamps=false"
Python
import os, httpx
with open("input.wav", "rb") as f:
r = httpx.post(
"https://api.fish.audio/v1/asr",
headers={"Authorization": f"Bearer {os.environ['FISH_API_KEY']}"},
files={"audio": f},
data={"language": "en", "ignore_timestamps": "false"},
timeout=120,
)
r.raise_for_status()
print(r.json()["text"])
Voice models — /model
List: GET /model
Query params: page_size (default 10), page_number (default 1), title, tag (string or array), self (bool — only your models), author_id, language, title_language, sort_by (score | task_count | created_at, default score).
Returns {total, items: ModelEntity[]}.
Create: POST /model (multipart/form-data)
Required: type=tts, title, train_mode=fast, voices (one or more audio file uploads).
Optional: visibility (public | unlist | private, default public; cover_image is required if public), description, cover_image, texts (transcripts matching each voice; if omitted, ASR is run on the audio), tags (string or array), enhance_audio_quality (bool, default false).
curl --request POST https://api.fish.audio/model \
--header "Authorization: Bearer $FISH_API_KEY" \
--form "type=tts" \
--form "train_mode=fast" \
--form "title=My Voice" \
--form "visibility=private" \
--form "voices=@sample1.wav" \
--form "voices=@sample2.wav" \
--form "texts=Transcript of sample 1." \
--form "texts=Transcript of sample 2." \
--form "tags=en" \
--form "tags=narration"
Returns 201 with the full ModelEntity including _id, state (created | training | trained | failed), visibility, samples, author, counts, timestamps. Use _id as reference_id in /v1/tts.
Get / Update / Delete
GET /model/{id}→ModelEntityPATCH /model/{id}— JSON, form-urlencoded, multipart, or msgpack. Nullable fields:title,description,cover_image(binary),visibility,tags.DELETE /model/{id}→ 200 on success.
curl --request PATCH https://api.fish.audio/model/<id> \
--header "Authorization: Bearer $FISH_API_KEY" \
--header "Content-Type: application/json" \
--data '{"title": "Renamed", "visibility": "unlist"}'
Wallet
GET /wallet/self/package→{user_id, type, total, balance, created_at, updated_at, finished_at}GET /wallet/self/api-credit→{_id, user_id, credit, created_at, updated_at, has_phone_sha256, has_free_credit}. Pass?check_free_credit=trueto also populatehas_free_credit(defaultfalse— the field isnullwhen not checked).
Replace self with a specific user_id if you have permission; otherwise always use self.
WebSocket TTS — wss://api.fish.audio/v1/tts/live
For low-latency / streaming TTS (e.g. LLM token stream → speech). All frames are MessagePack-encoded binary messages.
Connection headers
Authorization: Bearer <FISH_API_KEY>model: s2-pro(ors1) — required
Event sequence
Client → server:
StartEvent— once, first message:{event: "start", request: <TTSRequest>}. Therequestobject is the same schema asPOST /v1/ttsabove. Usuallyrequest.text = ""and the real text streams inTextEvents.TextEvent— one per text chunk:{event: "text", text: "..."}. Send as many as needed.FlushEvent— optional:{event: "flush"}. Forces the server to synthesize buffered text immediately (use for turn-taking / low-latency flushes).CloseEvent— final:{event: "stop"}. Note the literal isstop, notclose.
Server → client:
AudioEvent:{event: "audio", audio: <bytes>}— many of these, concatenate in order to reconstruct the audio stream in the format set byrequest.format.FinishEvent:{event: "finish", reason: "stop" | "error"}— exactly one, then the server closes the socket. Ignore unknown events for forward compatibility.
Python example (websockets>=14 + msgpack)
additional_headers is the parameter name in websockets v14+. On older
releases use extra_headers= or import from websockets.legacy.client.
import asyncio, os, msgpack, websockets
from websockets.exceptions import ConnectionClosed
API_KEY = os.environ["FISH_API_KEY"]
URL = "wss://api.fish.audio/v1/tts/live"
start = {
"event": "start",
"request": {
"text": "",
"reference_id": "<voice-model-id>",
"format": "mp3",
"latency": "normal",
},
}
async def run(text_stream):
headers = {"Authorization": f"Bearer {API_KEY}", "model": "s2-pro"}
async with websockets.connect(URL, additional_headers=headers,
max_size=None) as ws:
await ws.send(msgpack.packb(start, use_bin_type=True))
async def sender():
try:
async for chunk in text_stream:
await ws.send(msgpack.packb(
{"event": "text", "text": chunk}, use_bin_type=True))
await ws.send(msgpack.packb({"event": "stop"}, use_bin_type=True))
except ConnectionClosed:
pass # server sent finish before the text stream drained
send_task = asyncio.create_task(sender())
try:
with open("out.mp3", "wb") as f:
async for raw in ws:
msg = msgpack.unpackb(raw, raw=False)
if msg["event"] == "audio":
f.write(msg["audio"])
elif msg["event"] == "finish":
if msg["reason"] == "error":
raise RuntimeError("TTS failed")
break
finally:
send_task.cancel()
try:
await send_task
except (asyncio.CancelledError, ConnectionClosed):
pass
async def words():
for w in ["Hello", " from", " Fish", " Audio."]:
yield w
asyncio.run(run(words()))
Node.js example (ws + @msgpack/msgpack)
import WebSocket from "ws";
import { encode, decode } from "@msgpack/msgpack";
import { createWriteStream } from "node:fs";
const ws = new WebSocket("wss://api.fish.audio/v1/tts/live", {
headers: {
Authorization: `Bearer ${process.env.FISH_API_KEY}`,
model: "s2-pro",
},
});
const out = createWriteStream("out.mp3");
ws.on("open", () => {
ws.send(encode({
event: "start",
request: { text: "", reference_id: "<voice-model-id>", format: "mp3" },
}));
ws.send(encode({ event: "text", text: "Hello from Fish Audio." }));
ws.send(encode({ event: "stop" }));
});
ws.on("message", (buf) => {
const msg = decode(buf);
if (msg.event === "audio") out.write(Buffer.from(msg.audio));
else if (msg.event === "finish") {
out.end();
ws.close();
if (msg.reason === "error") throw new Error("TTS failed");
}
});
Emotion / expression control
The S1 model uses (parenthesis) tags inside text, e.g. (happy) What a day!. S2-Pro uses free-form [bracket] natural-language tags, e.g. [slightly sarcastic, rising tone]. Either works through text — no separate parameter. Full list: https://docs.fish.audio/api-reference/emotion-reference.md.
Encoding and content-type rules
- Use
application/jsonfor normal TTS requests — it's the simplest and works forreference_idflows. - Use
application/msgpackwhen you need to send raw audio bytes inline (inlinereferences, or the WebSocket protocol). - Use
multipart/form-datafor/v1/asrandPOST /modelbecause they upload files. - All WebSocket frames are MessagePack binary, regardless of inner payload.
Error handling checklist
- 401 → missing / bad
Authorizationheader. - 402 → out of credit. Check
/wallet/self/api-credit. - 404 → bad
model/{id}(voice model doesn't exist or isn't visible to you). - 422 → validation. The response is an array; each item's
locpoints at the offending field. Most common causes:modelheader missing on/v1/ttsor WebSocket.reference_idis an array but model iss1(multi-speaker requiress2-pro).referencessent withContent-Type: application/json(must be msgpack).- Numeric param out of range (
temperature,top_p,chunk_length,min_chunk_length,prosody.speed,early_stop_threshold). mp3_bitrate/opus_bitrateset without matchingformat.
- WebSocket: a
finishevent withreason: "error"means the server failed mid-stream — surface the message and reconnect rather than retrying on the same socket.
Decision shortcuts
- User just wants audio from text →
POST /v1/ttswith JSON +reference_id. - User has a raw voice clip and wants instant cloning →
POST /v1/ttswith MessagePack +references. - User wants dialogue between multiple speakers →
POST /v1/ttsons2-prowithreference_idarray and<|speaker:N|>tags. - User is streaming tokens from an LLM and wants speech to play as it arrives → WebSocket
/v1/tts/live. - User wants a persistent custom voice they can reuse →
POST /modelfirst, then reuse the returned_idasreference_id. - User wants a transcript →
POST /v1/asr.