youtube-transcription
SKILL.md
YouTube Video Transcription
Transcribe YouTube videos to text using OpenAI Whisper and yt-dlp.
Overview
This skill downloads audio from YouTube videos using yt-dlp and transcribes it using OpenAI's Whisper model. Supports multiple output formats (txt, srt, vtt, json) and various model sizes for different accuracy/speed tradeoffs.
Instructions
1. Install dependencies
# Install whisper and yt-dlp
pip install openai-whisper yt-dlp
# Verify ffmpeg is installed (required for audio processing)
ffmpeg -version
If ffmpeg is missing:
- macOS:
brew install ffmpeg - Ubuntu/Debian:
sudo apt install ffmpeg - Windows: Download from https://ffmpeg.org/download.html
2. Download audio from YouTube
# Download best audio quality as WAV
yt-dlp -x --audio-format wav -o "%(title)s.%(ext)s" "YOUTUBE_URL"
# Download as MP3 (smaller file)
yt-dlp -x --audio-format mp3 -o "%(title)s.%(ext)s" "YOUTUBE_URL"
# Download with video ID as filename (safer for special characters)
yt-dlp -x --audio-format wav -o "%(id)s.%(ext)s" "YOUTUBE_URL"
3. Choose Whisper model
| Model | Parameters | VRAM | Relative Speed | Use Case |
|---|---|---|---|---|
| tiny | 39M | ~1 GB | ~32x | Quick drafts, testing |
| base | 74M | ~1 GB | ~16x | Fast transcription |
| small | 244M | ~2 GB | ~6x | Good balance |
| medium | 769M | ~5 GB | ~2x | High accuracy |
| large | 1550M | ~10 GB | 1x | Best accuracy |
English-only models (tiny.en, base.en, small.en, medium.en) are faster for English content.
4. Run transcription
CLI approach:
# Basic transcription (auto-detect language)
whisper audio.wav --model medium
# Specify language for better accuracy
whisper audio.wav --model medium --language en
# Output specific format
whisper audio.wav --model medium --output_format srt
# All formats at once
whisper audio.wav --model medium --output_format all
# Specify output directory
whisper audio.wav --model medium --output_dir ./transcripts
Python approach:
import whisper
# Load model (downloads on first run)
model = whisper.load_model("medium")
# Transcribe
result = model.transcribe("audio.wav", language="en")
# Get plain text
print(result["text"])
# Get segments with timestamps
for segment in result["segments"]:
print(f"[{segment['start']:.2f} - {segment['end']:.2f}] {segment['text']}")
5. One-liner pipeline
Combine download and transcription:
# Download and transcribe in one command
yt-dlp -x --audio-format wav -o "audio.wav" "YOUTUBE_URL" && whisper audio.wav --model medium --output_format all
6. Alternative: yt-whisper tool
For simpler workflow, use the dedicated yt-whisper package:
# Install
pip install git+https://github.com/m1guelpf/yt-whisper.git
# Transcribe directly from URL
yt_whisper "https://www.youtube.com/watch?v=VIDEO_ID"
# With options
yt_whisper "YOUTUBE_URL" --model medium --language en --output_format srt
Output Formats
| Format | Extension | Description |
|---|---|---|
| txt | .txt | Plain text transcript |
| srt | .srt | SubRip subtitle format (with timestamps) |
| vtt | .vtt | WebVTT subtitle format |
| tsv | .tsv | Tab-separated values |
| json | .json | Full data with word-level timestamps |
Examples
Guidelines
- Use
--languageflag when you know the spoken language for significantly better accuracy - For long videos (>1 hour), use
smallormediummodel to balance speed and accuracy - English-only models (
.ensuffix) are faster and more accurate for English content - GPU with CUDA dramatically speeds up transcription; CPU works but is 5-10x slower
- If transcription fails, ensure ffmpeg is properly installed and in PATH
- For videos with background music, larger models (medium/large) handle it better
- Clean up audio files after transcription to save disk space
- Use
--output_format allto get every format at once, then choose what you need
Weekly Installs
1
Repository
terminalskills/skillsGitHub Stars
15
First Seen
3 days ago
Security Audits
Installed on
amp1
cline1
augment1
opencode1
cursor1
kimi-cli1