transcribe
This skill contains shell command directives (!`command`) that may execute system commands. Review carefully before installing.
Transcribe Skill
Transcribe audio or video to text using OpenAI Whisper locally. Zero config, no API keys, no cloud.
Install
npx skills add agairola/life-skills --skill transcribe
When to Use
Trigger this skill when the user:
- Asks to transcribe an audio or video file
- Says "what did they say in this file", "get a transcript", "convert this to text"
- Mentions speech-to-text, STT, or Whisper
- Wants subtitles, captions, SRT, or VTT from a media file
- Provides a path or URL to an audio/video file and wants the spoken content
Prerequisites
- uv —
brew install uv(macOS) orpip install uv - ffmpeg — only needed on Apple Silicon (MLX engine).
brew install ffmpeg. The script auto-falls back to the CPU engine (no ffmpeg needed) if ffmpeg is missing. - Models — auto-downloaded on first run (~800MB for the default
large-v3-turbo, cached in~/.cache/huggingface/). - API keys — none.
- Dependencies — declared inline (PEP 723), installed by
uv run.
Setup Status
!command -v uv > /dev/null 2>&1 && echo "uv: installed" || echo "uv: NOT INSTALLED — run: brew install uv"
!command -v ffmpeg > /dev/null 2>&1 && echo "ffmpeg: installed (MLX engine available on Apple Silicon)" || echo "ffmpeg: not installed (will use CPU engine fallback; for best speed on Apple Silicon run: brew install ffmpeg)"
Command Template
uv run "${CLAUDE_SKILL_DIR}/scripts/transcribe.py" <INPUT> [OPTIONS]
Options
| Flag | Values | Default | Purpose |
|---|---|---|---|
input |
file path | required | Path to audio/video file |
--model |
tiny, base, small, medium, large-v3, large-v3-turbo, turbo | large-v3-turbo |
Whisper model. turbo is the accuracy/speed sweet spot. |
--language |
ISO code (en, fr, de, …) or auto |
auto |
Spoken language. Auto-detect is reliable but slower. |
--format |
txt, srt, vtt, json | txt |
Output format. json includes per-segment timestamps. |
--output |
file path | (stdout) | Write transcript to file instead of stdout. |
--word-timestamps |
flag | off | Include word-level timing (JSON format only). |
--engine |
auto, mlx, faster | auto |
Force an engine. auto picks MLX on Apple Silicon, faster-whisper elsewhere. |
--translate |
flag | off | Translate to English instead of transcribing in source language. |
--json-summary |
flag | off | Print a JSON summary to stdout with file path + stats. |
The default prints the plain transcript to stdout. Use --output to save it, or --json-summary for structured metadata the agent can parse.
Common Commands
# Transcribe an audio file (prints plain text)
uv run "${CLAUDE_SKILL_DIR}/scripts/transcribe.py" ~/Downloads/meeting.m4a
# Save as SRT subtitles
uv run "${CLAUDE_SKILL_DIR}/scripts/transcribe.py" video.mp4 --format srt --output video.srt
# Fast preview with a smaller model
uv run "${CLAUDE_SKILL_DIR}/scripts/transcribe.py" interview.mp3 --model small
# Transcribe a non-English file with explicit language
uv run "${CLAUDE_SKILL_DIR}/scripts/transcribe.py" hindi.m4a --language hi
# Translate any language to English
uv run "${CLAUDE_SKILL_DIR}/scripts/transcribe.py" french.mp3 --translate
# JSON with per-segment timestamps (for further processing)
uv run "${CLAUDE_SKILL_DIR}/scripts/transcribe.py" lecture.mp3 --format json --output lecture.json
# JSON summary to stdout (file path, duration, word count)
uv run "${CLAUDE_SKILL_DIR}/scripts/transcribe.py" clip.wav --output clip.txt --json-summary
Presenting Results
Follow the shared formatting rules in ../../references/platform-formatting.md when showing the transcript in chat platforms (no markdown tables, end with a follow-up prompt).
Short transcripts (< ~1000 words)
Show the transcript inline:
Here's the transcript of meeting.m4a (3m 24s, English, large-v3-turbo):
[transcript body]
Want me to save it as SRT/VTT subtitles, or translate to another language?
Long transcripts
Save to file and show a preview:
Transcribed lecture.mp3 (47m, English). Saved full transcript to lecture.txt.
Preview:
[first ~200 words]
Want me to summarize it, extract key points, or generate subtitles?
With timestamps (SRT / VTT)
Save to file and confirm:
Generated SRT subtitles for video.mp4 (12 segments). Saved to video.srt.
Handling Edge Cases
- First run (no models): The default model (
large-v3-turbo, ~800MB) auto-downloads. Tell the user "Downloading the Whisper model (~800MB), this is a one-time setup." Subsequent runs start in seconds. - ffmpeg missing on Apple Silicon: The script auto-falls back to faster-whisper (CPU). Mention this and suggest
brew install ffmpegfor faster transcription next time. - Very large files (> 1 hour): Warn the user transcription may take several minutes.
large-v3-turboon MLX processes ~10 min audio in under a minute on M-series; on CPU expect roughly real-time. - Unsupported format: If ffmpeg is present, most formats work (mp3, m4a, wav, mp4, mov, flac, ogg, webm, opus). Without ffmpeg, faster-whisper's PyAV backend still handles most common formats.
- Very short clips (< 5 seconds): Whisper may hallucinate. Suggest
--model smallfor short snippets — it's less prone to this. - Accuracy-critical work: Suggest
--model large-v3(non-turbo) for best accuracy. Turbo trades a small accuracy drop for ~5x speed. - Non-English audio: Auto-detect is reliable; passing
--languageskips detection and is slightly faster. For mixed-language audio, let auto-detect run. - Noisy or low-quality audio: Suggest
--model large-v3for noisy recordings;turboand smaller models degrade faster on noisy inputs.
Model Reference
| Model | Size | Speed (M-series, MLX) | Accuracy | When to use |
|---|---|---|---|---|
tiny |
~75MB | Fastest | Lowest | Quick drafts, clear speech |
base |
~140MB | Very fast | Low-ish | Simple English audio |
small |
~465MB | Fast | Good | Everyday use, short clips |
medium |
~1.5GB | Moderate | Better | Quality bump over small |
large-v3-turbo |
~800MB | Fast (default) | Near large-v3 | Default — best balance |
large-v3 |
~3GB | Slower | Best | Accuracy-critical, noisy audio |
Turbo has fewer decoder layers than large-v3 but retains the same encoder — it's specifically designed as the accuracy-for-speed tradeoff point and is the right default for almost all use cases.