soniox
Soniox Speech-to-Text
Cloud speech-to-text API with real-time WebSocket streaming and async file transcription. Supports 60+ languages, speaker diarization, live translation, and custom vocabulary context.
API Structure
Two main APIs:
| API | Transport | Model | Use Case |
|---|---|---|---|
| Real-Time | WebSocket wss://stt-rt.soniox.com/transcribe-websocket |
stt-rt-v4 |
Live audio streaming, token-by-token results |
| Async | REST https://api.soniox.com/v1/ |
stt-async-v4 |
Pre-recorded files, batch processing |
Authentication: Authorization: Bearer <API_KEY> header (or api_key query param for WebSocket).
Quick Start — Real-Time WebSocket
- Connect to
wss://stt-rt.soniox.com/transcribe-websocket?api_key=YOUR_KEY - Send JSON config:
{"model": "stt-rt-v4", "audio_format": "pcm_s16le", "sample_rate": 16000} - Stream raw audio bytes
- Receive JSON tokens:
{"tokens": [{"text": "hello", "is_final": true, "start_ms": 100, "end_ms": 500}]} - Close connection when done
Key token fields: text, is_final (false=provisional, true=confirmed), start_ms, end_ms, confidence, speaker (if diarization enabled), language (if language ID enabled).
Quick Start — Async REST
# Upload and transcribe
curl -X POST https://api.soniox.com/v1/transcriptions \
-H "Authorization: Bearer $API_KEY" \
-F model=stt-async-v4 \
-F audio_file=@recording.mp3
# Poll for result
curl https://api.soniox.com/v1/transcriptions/{id} \
-H "Authorization: Bearer $API_KEY"
Configuration Options (Both APIs)
Common parameters sent in start config (real-time) or request body (async):
| Parameter | Type | Description |
|---|---|---|
model |
string | stt-rt-v4 or stt-async-v4 |
language_hints |
string[] | ISO 639-1 codes to improve accuracy |
language_hints_strict |
bool | Restrict recognition to hinted languages |
enable_language_identification |
bool | Detect language per token |
enable_speaker_diarization |
bool | Label speakers (up to 15) |
translation |
object | Translation config: {"type": "one_way", "target_language": "fr"} or {"type": "two_way", "language_a": "en", "language_b": "fr"} |
context |
object | Domain context (see below) |
max_endpoint_delay_ms |
int | 500-3000ms, semantic endpoint detection (real-time only) |
Context Object Format
{
"context": {
"general": [
{"key": "domain", "value": "Healthcare"},
{"key": "topic", "value": "Medical Consultation"}
],
"text": "Background: Patient discussing cardiac symptoms...",
"terms": ["myocardial infarction", "stent", "angioplasty"],
"translation_terms": [
{"source": "stent", "target": "стент"}
]
}
}
Max 8000 tokens.
Reference Files
Read these based on the specific task:
| File | When to Read |
|---|---|
| references/realtime.md | WebSocket protocol details, token streaming, finalization, keepalive, error codes |
| references/async-api.md | REST endpoints, file upload, job polling, webhooks, file management |
| references/features.md | Languages list, diarization details, context format, models, timestamps |
| references/sdks.md | Python/Node/Web SDK usage, code patterns, client initialization |
| references/integrations.md | Direct/Proxy stream patterns, Vercel AI, TanStack, Twilio, n8n, data residency, security |
Native Swift/macOS Integration
Soniox has no native Swift SDK. For macOS/iOS apps, connect via raw WebSocket:
// URLSessionWebSocketTask approach
let url = URL(string: "wss://stt-rt.soniox.com/transcribe-websocket?api_key=\(apiKey)")!
let task = URLSession.shared.webSocketTask(with: url)
task.resume()
// Send start config
let config = """
{"model":"stt-rt-v4","audio_format":"pcm_s16le","sample_rate":16000}
"""
task.send(.string(config)) { error in /* handle */ }
// Stream audio bytes from microphone
task.send(.data(audioBuffer)) { error in /* handle */ }
// Receive tokens
func receiveNext() {
task.receive { result in
switch result {
case .success(.string(let json)):
// Parse tokens from JSON
break
case .failure(let error):
// Handle error
break
default: break
}
receiveNext() // Continue receiving
}
}
Audio format: Send raw PCM signed 16-bit little-endian at 16kHz mono for best results. The API also auto-detects encoded formats (mp3, ogg, flac, wav, etc.).
Rate Limits
| Limit | Real-Time | Async |
|---|---|---|
| Requests/min | 100 | 100 |
| Concurrent | 10 connections | 100 pending jobs |
| Max duration | 300 min/session | — |
| Storage | — | 10GB, 1000 files |
| Total transcriptions | — | 2000 |
Data Residency
Regional endpoints available:
| Region | Real-Time Endpoint | Async Endpoint |
|---|---|---|
| US (default) | stt-rt.soniox.com |
api.soniox.com |
| EU | stt-rt.eu.soniox.com |
api.eu.soniox.com |
| Japan | stt-rt-jp.soniox.com |
api.jp.soniox.com |