assemblyai

Installation
SKILL.md

AssemblyAI Speech-to-Text and Voice AI

AssemblyAI provides speech-to-text APIs, audio intelligence models, and an LLM Gateway for applying language models to transcripts. This skill corrects common mistakes that training data gets wrong — deprecated APIs, discontinued SDKs, and non-obvious auth patterns.

Authentication

All endpoints use the same header:

Authorization: YOUR_API_KEY

NOT Authorization: Bearer ... — just the raw API key, no Bearer prefix. This is the #1 mistake.

Base URLs

Service US EU
REST API https://api.assemblyai.com https://api.eu.assemblyai.com
LLM Gateway https://llm-gateway.assemblyai.com/v1 https://llm-gateway.eu.assemblyai.com/v1
Streaming v3 wss://streaming.assemblyai.com/v3/ws wss://streaming.eu.assemblyai.com/v3/ws
Streaming v2 (legacy) wss://api.assemblyai.com/v2/realtime/ws
Voice Agent API wss://agents.assemblyai.com/v1/ws

SDKs

Language Package Status
Python pip install assemblyai Active
JavaScript/TypeScript npm i assemblyai Active
Ruby assemblyai gem Active
Java assemblyai-java-sdk Discontinued April 2025
Go assemblyai-go-sdk Discontinued April 2025
C# .NET AssemblyAI NuGet Discontinued April 2025

Only Python, JS/TS, and Ruby SDKs are maintained. For Java, Go, or C#, use the REST API directly.

Speech-to-Text Models

Pre-Recorded

Model Languages Best For
Universal-3 Pro 6 (en, es, de, fr, pt, it) Highest accuracy, promptable transcription, keyterms up to 1,000 words
Universal-2 99 Broadest language coverage, keyterms up to 200 words

Use speech_models as a priority list with fallback: ["universal-3-pro", "universal-2"].

Streaming

Model Languages Best For
universal-streaming-english 1 (English) Voice agents, ~300ms latency
universal-streaming-multilingual 6 Per-utterance language detection
whisper-rt 99+ Broadest streaming language support, auto-detect only
u3-rt-pro 6 Voice agents — punctuation-based turn detection, promptable

Medical Mode (Add-On)

domain: "medical-v1" enables Medical Mode — an add-on that improves accuracy for medical terminology (medications, procedures, conditions, dosages). Works with both pre-recorded and streaming models.

  • Pre-recorded: Universal-3 Pro (domain: "medical-v1" in request body), Universal-2
  • Streaming: u3-rt-pro, universal-streaming-english, universal-streaming-multilingual
  • Supported languages: English, Spanish, German, French (4 languages only)
  • Billed as a separate add-on. If used with an unsupported language, the API ignores domain and returns a warning — transcript still completes and you are NOT charged for Medical Mode.

Prompting (Universal-3 Pro only)

Two mutually exclusive customization parameters:

  • prompt (string, up to 1500 words): Natural language instructions for transcription style
  • keyterms_prompt (string[], up to 1000 terms): Domain vocabulary for proper nouns, brands, technical terms

Prompting best practices:

  • Use positive, authoritative instructions — NEVER use negative phrasing ("Don't", "Avoid", "Never") as the model gets confused
  • Limit to 3-6 instructions for best results
  • Prefix critical instructions with "Non-negotiable:" or "Required:"

LeMUR is Deprecated

LeMUR is deprecated (sunset March 31, 2026 — already sunset). Use the LLM Gateway instead. The LLM Gateway is an OpenAI-compatible API. Key difference: you pass transcript text directly in messages (no transcript_ids). Transcribe first, then include transcript.text in your prompt.

See references/llm-gateway.md for models, tool calling, structured outputs, and examples.

Key Gotchas

Gotcha Details
prompt + keyterms_prompt Mutually exclusive — use one or the other
summarization / auto_chapters Deprecated. Use LLM Gateway instead (transcribe → send text to LLM)
PII redaction scope Only redacts words in text — other feature outputs (entities, summaries) may still expose sensitive data
Upload key scoping Files uploaded with one API key project cannot be transcribed with a different project's key
Structured outputs Supported by OpenAI, Gemini, Claude 4.5+, Qwen, and Kimi — Claude 3.x does NOT support json_schema structured outputs
U3 Pro turn detection Uses punctuation (. ? !), NOT confidence thresholds — end_of_turn_confidence_threshold has no effect
Negative prompts Never use "Don't" or "Avoid" in prompts — rephrase as positive instructions
PII audio redaction method override_audio_redaction_method: "silence" replaces PII with silence instead of default beep
Language detection Requires minimum 15 seconds of spoken audio for reliable results
LLM Gateway EU region Only Anthropic Claude and Google Gemini models available — OpenAI models are NOT supported in EU
Disfluencies disfluencies: true works on Universal-2 only; for U3 Pro, use prompting instead
Medical Mode unsupported language API silently skips Medical Mode and does not charge for it — check for warning in response
Voice Agent API URL The Voice Agent endpoint is wss://agents.assemblyai.com/v1/ws — NOT /v1/voice (renamed April 2026), /v1/realtime (older), or speech-to-speech.us.assemblyai.com (very old)
Voice Agent tool.call field The argument dict is named arguments, not args (renamed April 2026)
Voice Agent turn detection fields Use min_silence (default 600ms) and max_silence (default 1500ms) under session.input.turn_detectionmin_turn_silence/max_turn_silence are the streaming/LiveKit/Pipecat field names, not Voice Agent API

Common Mistakes

Mistake Correction
Authorization: Bearer KEY Authorization: KEY (no Bearer prefix) — BUT the Voice Agent API (agents.assemblyai.com) uses Authorization: Bearer KEY
Using LeMUR API Deprecated. Use LLM Gateway instead
Using summarization or auto_chapters Deprecated. Use LLM Gateway instead (transcribe then summarize via LLM)
LeMUR transcript_ids with LLM Gateway Pass transcript text in messages, not IDs
anthropic/claude-... model IDs No provider prefix: claude-sonnet-4-5-20250929 not anthropic/claude-sonnet-4-5-20250929
Using Java/Go/C# SDKs Discontinued. Use Python, JS/TS, Ruby, or raw API
word_boost parameter Use keyterms_prompt instead
Hardcoding v2 streaming URL v3 (/v3/ws) is current; v2 still works but is legacy
Omitting speech_models / speech_model Required — no default exists. Omitting causes the request to fail. Use ["universal-3-pro", "universal-2"] for pre-recorded, "u3-rt-pro" for streaming
aai.SpeechModel.universal_3_pro in Python SDK Use raw strings: "universal-3-pro", "universal-2" — these enum aliases don't exist in the SDK
S2S session.update without "session" key Must wrap config: {"type":"session.update","session":{...}}
S2S tool schema using {"function":{...}} nesting S2S tools are flat: {"type":"function","name":"...","description":"...","parameters":{...}}
Voice Agent S2S URL Correct URL: wss://agents.assemblyai.com/v1/ws — not /v1/voice (renamed April 2026), /v1/realtime (older), or speech-to-speech.us.assemblyai.com (very old)
Voice Agent tool.call args field Renamed to argumentsevent["arguments"] is the parameter dict
Medical Mode domain: "medical" Correct value is domain: "medical-v1"
LLM Gateway tool result role: "function_call_output" Correct role is "tool" — use {"role": "tool", "tool_call_id": "...", "content": "..."}

Reference Files

Read the relevant reference file based on what the user needs:

File When to read
references/python-sdk.md Python SDK patterns and examples
references/js-sdk.md JavaScript/TypeScript SDK patterns
references/streaming.md Real-time/streaming STT, v3 protocol, temp tokens, error codes
references/voice-agents.md Voice agent integrations: LiveKit, Pipecat, turn detection, latency optimization
references/llm-gateway.md Applying LLMs to transcripts, tool calling, available models
references/speech-understanding.md Translation, speaker identification, custom formatting
references/audio-intelligence.md PII redaction, diarization, summarization, sentiment, chapters
references/api-reference.md Full parameter list, export endpoints, webhooks, upload, PII policies

API Spec Source of Truth

https://github.com/AssemblyAI/assemblyai-api-spec

Installs
106
GitHub Stars
7
First Seen
Mar 27, 2026