together-audio

Installation
SKILL.md

Together Audio

Overview

Use Together AI audio APIs for:

  • text-to-speech generation
  • streaming or realtime voice output
  • speech-to-text transcription
  • translation, diarization, and timestamps
  • live captioning and realtime transcription

When This Skill Wins

  • Generate spoken audio from text
  • Transcribe uploaded audio files or URLs
  • Add realtime voice or captioning to an app
  • Extract speaker segments or word timings

Hand Off To Another Skill

  • Use together-chat-completions for text-only generation
  • Use together-video or together-images for visual generation workflows
  • Use together-dedicated-endpoints only when the audio model itself must be hosted on dedicated infrastructure

Quick Routing

Workflow

  1. Confirm whether the task is TTS or STT.
  2. Choose REST, streaming, or realtime transport based on latency and interaction needs.
  3. Pick the model and response format from the relevant reference file.
  4. Start from the matching script instead of rebuilding the request contract from memory.
  5. For Python STT uploads, open audio files in binary mode and pass the file handle rather than a bare path string.

High-Signal Rules

  • Python scripts require the Together v2 SDK (together>=2.0.0). If the user is on an older version, they must upgrade first: uv pip install --upgrade "together>=2.0.0".
  • Use client.audio.speech.create() for TTS.
  • REST TTS returns a BinaryAPIResponse; call response.write_to_file(path) to save it. Do NOT use stream_to_file (it does not exist on this object).
  • Streaming TTS (stream=True) returns a Stream of AudioSpeechStreamChunk objects. Iterate chunks, check chunk.type, and decode base64.b64decode(chunk.delta) for audio data. There is no file-writing helper on the stream object.
  • Use client.audio.transcriptions.create() for transcription and client.audio.translations.create() for translation.
  • Realtime APIs require audio-format discipline; confirm PCM expectations before streaming bytes.
  • Diarization and word timestamps change response shape; code for the richer verbose output explicitly.

Resource Map

Official Docs

Related skills

More from zainhas/togetherai-skills

Installs
14
GitHub Stars
2
First Seen
Feb 27, 2026