speech-to-text
Transcribe audio and video to text with speaker identification, word-level timestamps, and 90+ language support.
- Two models available:
scribe_v2for batch transcription with high accuracy, andscribe_v2_realtimefor live transcription with ~150ms latency - Speaker diarization labels each word with speaker ID; keyterm prompting helps recognize domain-specific vocabulary and proper nouns
- Word-level timestamps include type classification (word, spacing, audio event) for precise timing and subtitle generation
- Real-time streaming supports partial and committed transcripts with manual or Voice Activity Detection (VAD) commit strategies; client-side React integration available
- Supports MP3, WAV, M4A, FLAC, OGG, WebM, AAC, AIFF, Opus audio and MP4, AVI, MKV, MOV, WMV, FLV, WebM, MPEG, 3GPP video up to 3GB and 10 hours
ElevenLabs Speech-to-Text
Transcribe audio to text with Scribe v2 - supports 90+ languages, speaker diarization, and word-level timestamps.
Setup: See Installation Guide. For JavaScript, use
@elevenlabs/*packages only.
Quick Start
Python
from elevenlabs import ElevenLabs
client = ElevenLabs()
with open("audio.mp3", "rb") as audio_file:
result = client.speech_to_text.convert(file=audio_file, model_id="scribe_v2")
print(result.text)
More from elevenlabs/skills
text-to-speech
Convert text to speech using ElevenLabs voice AI. Use when generating audio from text, creating voiceovers, building voice apps, or synthesizing speech in 70+ languages.
4.8Kagents
Build voice AI agents with ElevenLabs. Use when creating voice assistants, customer service bots, interactive voice characters, or any real-time voice conversation experience.
3.2Ksound-effects
Generate sound effects from text descriptions using ElevenLabs. Use when creating sound effects, generating audio textures, producing ambient sounds, cinematic impacts, UI sounds, or any audio that isn't speech. Supports looping, duration control, and prompt influence tuning.
2.6Kmusic
Generate music using ElevenLabs Music API. Use when creating instrumental tracks, songs with lyrics, background music, jingles, or any AI-generated music composition. Supports prompt-based generation, composition plans for granular control, and detailed output with metadata.
2.6Ksetup-api-key
Guides users through setting up an ElevenLabs API key for ElevenLabs MCP tools. Use when the user needs to configure an ElevenLabs API key, when ElevenLabs tools fail due to missing API key, or when the user mentions needing access to ElevenLabs. First checks whether ELEVENLABS_API_KEY is already configured and valid, and only runs full setup when needed.
2.5Kvoice-isolator
Remove background noise and isolate vocals/speech from audio using ElevenLabs Voice Isolator (audio isolation) API. Use when cleaning up noisy recordings, removing music or background ambience from dialogue, isolating speech from field recordings, preparing audio for transcription, extracting vocals, or any "denoise / clean up / isolate voice" task.
392