elevenlabs-stt
Summary
98%+ accurate transcription with speaker diarization, audio event tagging, and word-level forced alignment.
- Supports Scribe v1 and v2 models with auto-detection across 90+ languages
- Capabilities include speaker identification, audio event tagging (laughter, applause, music), and precise word-level timestamps via forced alignment
- Forced alignment enables subtitle generation, lip-sync timing, and karaoke applications by aligning known text to audio
- Requires inference.sh CLI (
infsh) for execution; integrates with video captioning and other audio workflows
SKILL.md
ElevenLabs Speech-to-Text
High-accuracy transcription with Scribe models via inference.sh CLI.

Quick Start
Requires inference.sh CLI (
infsh). Install instructions
infsh login
# Transcribe audio
infsh app run elevenlabs/stt --input '{"audio": "https://audio.mp3"}'
Available Models
| Model | ID | Best For |
|---|---|---|
| Scribe v2 | scribe_v2 |
Latest, highest accuracy (default) |
| Scribe v1 | scribe_v1 |
Stable, proven |
- 98%+ transcription accuracy
- 90+ languages with auto-detection
Examples
Basic Transcription
infsh app run elevenlabs/stt --input '{"audio": "https://meeting-recording.mp3"}'
With Speaker Identification
infsh app run elevenlabs/stt --input '{
"audio": "https://meeting.mp3",
"diarize": true
}'
Audio Event Tagging
Detect laughter, applause, music, and other non-speech events:
infsh app run elevenlabs/stt --input '{
"audio": "https://podcast.mp3",
"tag_audio_events": true
}'
Specify Language
infsh app run elevenlabs/stt --input '{
"audio": "https://spanish-audio.mp3",
"language_code": "spa"
}'
Full Options
infsh app run elevenlabs/stt --input '{
"audio": "https://conference.mp3",
"model": "scribe_v2",
"diarize": true,
"tag_audio_events": true,
"language_code": "eng"
}'
Forced Alignment
Get precise word-level and character-level timestamps by aligning known text to audio. Useful for subtitles, lip-sync, and karaoke.
infsh app run elevenlabs/forced-alignment --input '{
"audio": "https://narration.mp3",
"text": "This is the exact text spoken in the audio file."
}'
Output Format
{
"words": [
{"text": "This", "start": 0.0, "end": 0.3},
{"text": "is", "start": 0.35, "end": 0.5},
{"text": "the", "start": 0.55, "end": 0.65}
],
"text": "This is the exact text spoken in the audio file."
}
Forced Alignment Use Cases
- Subtitles: Precise timing for video captions
- Lip-sync: Align audio to animated characters
- Karaoke: Word-by-word timing for lyrics
- Accessibility: Synchronized transcripts
Workflow: Video Subtitles
# 1. Transcribe video audio
infsh app run elevenlabs/stt --input '{
"audio": "https://video.mp4",
"diarize": true
}' > transcript.json
# 2. Use transcript for captions
infsh app run infsh/caption-videos --input '{
"video_url": "https://video.mp4",
"captions": "<transcript-from-step-1>"
}'
Supported Languages
90+ languages including: English, Spanish, French, German, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, Hindi, Russian, Turkish, Dutch, Swedish, and many more. Leave language_code empty for automatic detection.
Use Cases
- Meetings: Transcribe recordings with speaker identification
- Podcasts: Generate transcripts with audio event tags
- Subtitles: Create timed captions for videos
- Research: Interview transcription with diarization
- Accessibility: Make audio content searchable and accessible
- Lip-sync: Forced alignment for animation timing
Related Skills
# ElevenLabs TTS (reverse direction)
npx skills add inference-sh/skills@elevenlabs-tts
# ElevenLabs dubbing (translate audio)
npx skills add inference-sh/skills@elevenlabs-dubbing
# Other STT models (Whisper)
npx skills add inference-sh/skills@speech-to-text
# Full platform skill (all 150+ apps)
npx skills add inference-sh/skills@infsh-cli
Browse all audio apps: infsh app list --category audio
Weekly Installs
2.1K
Repository
inferen-sh/skillsGitHub Stars
169
First Seen
4 days ago
Security Audits
Installed on
claude-code1.6K
gemini-cli1.5K
codex1.4K
amp1.4K
github-copilot1.4K
opencode1.4K