Soniox Speech-to-Text

Cloud speech-to-text API with real-time WebSocket streaming and async file transcription. Supports 60+ languages, speaker diarization, live translation, and custom vocabulary context.

API Structure

Two main APIs:

API	Transport	Model	Use Case
Real-Time	WebSocket `wss://stt-rt.soniox.com/transcribe-websocket`	`stt-rt-v4`	Live audio streaming, token-by-token results
Async	REST `https://api.soniox.com/v1/`	`stt-async-v4`	Pre-recorded files, batch processing

Authentication: Authorization: Bearer <API_KEY> header (or api_key query param for WebSocket).

Quick Start — Real-Time WebSocket

Connect to wss://stt-rt.soniox.com/transcribe-websocket?api_key=YOUR_KEY
Send JSON config: {"model": "stt-rt-v4", "audio_format": "pcm_s16le", "sample_rate": 16000}
Stream raw audio bytes
Receive JSON tokens: {"tokens": [{"text": "hello", "is_final": true, "start_ms": 100, "end_ms": 500}]}
Close connection when done

Key token fields: text, is_final (false=provisional, true=confirmed), start_ms, end_ms, confidence, speaker (if diarization enabled), language (if language ID enabled).

Quick Start — Async REST

# Upload and transcribe
curl -X POST https://api.soniox.com/v1/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F model=stt-async-v4 \
  -F audio_file=@recording.mp3

# Poll for result
curl https://api.soniox.com/v1/transcriptions/{id} \
  -H "Authorization: Bearer $API_KEY"

Configuration Options (Both APIs)

Common parameters sent in start config (real-time) or request body (async):

Parameter	Type	Description
`model`	string	`stt-rt-v4` or `stt-async-v4`
`language_hints`	string[]	ISO 639-1 codes to improve accuracy
`language_hints_strict`	bool	Restrict recognition to hinted languages
`enable_language_identification`	bool	Detect language per token
`enable_speaker_diarization`	bool	Label speakers (up to 15)
`translation`	object	Translation config: `{"type": "one_way", "target_language": "fr"}` or `{"type": "two_way", "language_a": "en", "language_b": "fr"}`
`context`	object	Domain context (see below)
`max_endpoint_delay_ms`	int	500-3000ms, semantic endpoint detection (real-time only)

Context Object Format

{
  "context": {
    "general": [
      {"key": "domain", "value": "Healthcare"},
      {"key": "topic", "value": "Medical Consultation"}
    ],
    "text": "Background: Patient discussing cardiac symptoms...",
    "terms": ["myocardial infarction", "stent", "angioplasty"],
    "translation_terms": [
      {"source": "stent", "target": "стент"}
    ]
  }
}

Max 8000 tokens.

Reference Files

Read these based on the specific task:

File	When to Read
references/realtime.md	WebSocket protocol details, token streaming, finalization, keepalive, error codes
references/async-api.md	REST endpoints, file upload, job polling, webhooks, file management
references/features.md	Languages list, diarization details, context format, models, timestamps
references/sdks.md	Python/Node/Web SDK usage, code patterns, client initialization
references/integrations.md	Direct/Proxy stream patterns, Vercel AI, TanStack, Twilio, n8n, data residency, security

Native Swift/macOS Integration

Soniox has no native Swift SDK. For macOS/iOS apps, connect via raw WebSocket:

// URLSessionWebSocketTask approach
let url = URL(string: "wss://stt-rt.soniox.com/transcribe-websocket?api_key=\(apiKey)")!
let task = URLSession.shared.webSocketTask(with: url)
task.resume()

// Send start config
let config = """
{"model":"stt-rt-v4","audio_format":"pcm_s16le","sample_rate":16000}
"""
task.send(.string(config)) { error in /* handle */ }

// Stream audio bytes from microphone
task.send(.data(audioBuffer)) { error in /* handle */ }

// Receive tokens
func receiveNext() {
    task.receive { result in
        switch result {
        case .success(.string(let json)):
            // Parse tokens from JSON
            break
        case .failure(let error):
            // Handle error
            break
        default: break
        }
        receiveNext() // Continue receiving
    }
}

Audio format: Send raw PCM signed 16-bit little-endian at 16kHz mono for best results. The API also auto-detects encoded formats (mp3, ogg, flac, wav, etc.).

Rate Limits

Limit	Real-Time	Async
Requests/min	100	100
Concurrent	10 connections	100 pending jobs
Max duration	300 min/session	—
Storage	—	10GB, 1000 files
Total transcriptions	—	2000

Data Residency

Regional endpoints available:

Region	Real-Time Endpoint	Async Endpoint
US (default)	`stt-rt.soniox.com`	`api.soniox.com`
EU	`stt-rt.eu.soniox.com`	`api.eu.soniox.com`
Japan	`stt-rt-jp.soniox.com`	`api.jp.soniox.com`

soniox

Soniox Speech-to-Text

API Structure

Quick Start — Real-Time WebSocket

Quick Start — Async REST

Configuration Options (Both APIs)

Context Object Format

Reference Files

Native Swift/macOS Integration

Rate Limits

Data Residency

More from bbssppllvv/essential-skills

polar-integration

product-design

openrouter

fluidaudio

sayless