Together Audio

Overview

Use Together AI audio APIs for:

Use together-chat-completions for text-only generation
Use together-video or together-images for visual generation workflows
Use together-dedicated-endpoints only when the audio model itself must be hosted on dedicated infrastructure

REST TTS or streaming TTS
- Read references/tts-models.md
- Start with scripts/tts_generate.py or scripts/tts_generate.ts
Realtime TTS over WebSocket
- Read references/tts-models.md
- Start with scripts/tts_websocket.py
File transcription, translation, diarization, or timestamps
- Read references/stt-models.md
- Start with scripts/stt_transcribe.py or scripts/stt_transcribe.ts
Realtime STT
- Read references/stt-models.md
- Start with scripts/stt_realtime.py

Confirm whether the task is TTS or STT.
Choose REST, streaming, or realtime transport based on latency and interaction needs.
Pick the model and response format from the relevant reference file.
Start from the matching script instead of rebuilding the request contract from memory.
For Python STT uploads, open audio files in binary mode and pass the file handle rather than a bare path string.

Python scripts require the Together v2 SDK (together>=2.0.0). If the user is on an older version, they must upgrade first: uv pip install --upgrade "together>=2.0.0".
Use client.audio.speech.create() for TTS.
REST TTS returns a BinaryAPIResponse; call response.write_to_file(path) to save it. Do NOT use stream_to_file (it does not exist on this object).
Streaming TTS (stream=True) returns a Stream of AudioSpeechStreamChunk objects. Iterate chunks, check chunk.type, and decode base64.b64decode(chunk.delta) for audio data. There is no file-writing helper on the stream object.
Use client.audio.transcriptions.create() for transcription and client.audio.translations.create() for translation.
Realtime APIs require audio-format discipline; confirm PCM expectations before streaming bytes.
Diarization and word timestamps change response shape; code for the richer verbose output explicitly.