speech-to-text

Installation
Summary

Transcribe audio and video to text with speaker identification, word-level timestamps, and 90+ language support.

  • Two models available: scribe_v2 for batch transcription with high accuracy, and scribe_v2_realtime for live transcription with ~150ms latency
  • Speaker diarization labels each word with speaker ID; keyterm prompting helps recognize domain-specific vocabulary and proper nouns
  • Word-level timestamps include type classification (word, spacing, audio event) for precise timing and subtitle generation
  • Real-time streaming supports partial and committed transcripts with manual or Voice Activity Detection (VAD) commit strategies; client-side React integration available
  • Supports MP3, WAV, M4A, FLAC, OGG, WebM, AAC, AIFF, Opus audio and MP4, AVI, MKV, MOV, WMV, FLV, WebM, MPEG, 3GPP video up to 3GB and 10 hours
SKILL.md

ElevenLabs Speech-to-Text

Transcribe audio to text with Scribe v2 - supports 90+ languages, speaker diarization, and word-level timestamps.

Setup: See Installation Guide. For JavaScript, use @elevenlabs/* packages only.

Quick Start

Python

from elevenlabs import ElevenLabs

client = ElevenLabs()

with open("audio.mp3", "rb") as audio_file:
    result = client.speech_to_text.convert(file=audio_file, model_id="scribe_v2")

print(result.text)
Related skills

More from elevenlabs/skills

Installs
3.5K
GitHub Stars
232
First Seen
Jan 27, 2026