transcribe
SKILL.md
Transcribe Skill
Production-grade speech-to-text transcription with intelligent file handling, multiple output formats, and parallel processing.
When to Use
✅ USE this skill when:
- Transcribing audio recordings to text
- Creating subtitles for video content
- Converting speech to searchable text
- Needing word-level timestamps
- Processing podcasts or meeting recordings
- Transcribing interviews
- Converting audio notes to text
- Creating transcripts for video editing
❌ DON'T use this skill when:
- Transcribing YouTube videos → Use youtube-transcript (faster, no API cost)
- Real-time transcription → Use streaming tools
- Already have captions → Use youtube-transcript
- Need video-specific processing → Use ffmpeg-tools first
Prerequisites
# 1. Get Groq API key
# Visit: https://console.groq.com/
# Create an API key
# 2. Set environment variable
export GROQ_API_KEY="gsk_your_api_key_here"
# 3. Install FFmpeg (for audio processing)
brew install ffmpeg # macOS
sudo apt install ffmpeg # Ubuntu/Debian
# 4. Verify
node --version # Should show version
Commands
Basic Usage
# Basic transcription (outputs plain text)
{baseDir}/transcribe.js audio.m4a
# Transcribe with specific output format
{baseDir}/transcribe.js audio.mp3 --format srt --output subtitles.srt
{baseDir}/transcribe.js meeting.wav --format json --output result.json
# Specify language for better accuracy
{baseDir}/transcribe.js spanish.mp3 --language es --format text
{baseDir}/transcribe.js audio.mp3 --language de --format vtt
Output Formats
# Plain text (default)
{baseDir}/transcribe.js audio.mp3 --format text
Transcriber output follows without timestamps.
# JSON with detailed data
{baseDir}/transcribe.js audio.mp3 --format json
{
"text": "Transcription text...",
"duration": 123.45,
"language": "en",
"words": [{"word": "Transcription", "start": 0.0, "end": 0.5}, ...]
}
# SRT subtitles
{baseDir}/transcribe.js audio.mp3 --format srt --output subtitles.srt
1
00:00:00,000 --> 00:00:05,500
Transcription of the audio begins here
2
00:00:05,500 --> 00:00:11,200
And continues in the next segment
# VTT subtitles
{baseDir}/transcribe.js audio.mp3 --format vtt --output captions.vtt
WEBVTT
00:00.000 --> 00:05.500
Transcription of the audio begins here
# Word timings TSV
{baseDir}/transcribe.js audio.mp3 --format tsv
start\tend\tword
0.000\t0.450\tTranscription
0.450\t0.820\tof
0.820\t1.240\tthe
# Word timings CSV
{baseDir}/transcribe.js audio.mp3 --format csv
start,end,word
0.000,0.450,"Transcription"
0.450,0.820,"of"
0.820,1.240,"the"
Format Comparison:
| Format | Use Case | Word Timestamps | File Size |
|---|---|---|---|
text |
General use | ❌ | Small |
json |
API integration | ✅ | Large |
srt |
Subtitles | ⚠️ Phrases | Medium |
vtt |
Web captions | ⚠️ Phrases | Medium |
tsv |
Spreadsheet | ✅ | Medium |
csv |
Database import | ✅ | Medium |
word_timings |
Analysis | ✅ | Large |
Language Selection
# Auto-detect (default)
{baseDir}/transcribe.js audio.mp3
# Specify language for better accuracy
{baseDir}/transcribe.js audio.mp3 --language en # English
{baseDir}/transcribe.js audio.mp3 --language es # Spanish
{baseDir}/transcribe.js audio.mp3 --language fr # French
{baseDir}/transcribe.js audio.mp3 --language de # German
{baseDir}/transcribe.js audio.mp3 --language ja # Japanese
Supported Languages: All 99 languages supported by Whisper
Large File Processing
# Files >25MB are automatically segmented
{baseDir}/transcribe.js long-recording.mp3
# Progress shown for segmented files
⏳ Transcribing: Segment 3/12 (25.0%) | Elapsed: 45.2s
# Output combined automatically
Cache Control
# Use cache (default) - instant for previously transcribed
{baseDir}/transcribe.js audio.mp3
# Force fresh transcription
{baseDir}/transcribe.js audio.mp3 --no-cache
API Provider Selection
# Use Groq (default) - faster, cheaper
{baseDir}/transcribe.js audio.mp3 --provider groq
# Use OpenAI Whisper (requires OPENAI_API_KEY)
{baseDir}/transcribe.js audio.mp3 --provider openai
Supported Audio Formats
| Format | Extension | Notes |
|---|---|---|
| MP3 | .mp3 | Best compatibility |
| MP4 | .mp4, .m4a | iOS recordings |
| WAV | .wav | Uncompressed, large files |
| OGG | .ogg, .oga, .ogv | Open format |
| FLAC | .flac | Lossless compression |
| WebM | .webm | Web audio/videos |
| AAC | .aac | Apple format |
| WMA | .wma | Windows format |
Audio Preprocessing:
- Unsupported formats are auto-converted to MP3
- Sample rate normalized to 16kHz (Whisper optimal)
- Mono channel for better accuracy
- Bitrate: 192kbps MP3
Features
Automatic Segmentation
Large audio files are automatically split for processing:
Audio File >25MB
↓ FFmpeg
Convert to MP3 (16kHz, mono)
↓
Split into 10-minute segments
↓
Transcribe segments in parallel
↓
Merge results with adjusted timestamps
Segmentation Benefits:
- ✓ Handles recordings up to 2 hours
- ✓ Respects API rate limits
- ✓ Parallel processing for speed
- ✓ Seamless results (timestamps adjusted)
Word-Level Timestamps
Each word includes start and end timestamps:
{
"words": [
{"word": "Hello", "start": 0.000, "end": 0.320},
{"word": "and", "start": 0.320, "end": 0.560},
{"word": "welcome", "start": 0.560, "end": 0.980},
{"word": "everyone", "start": 0.980, "end": 1.420}
]
}
Uses for Timestamps:
- Jump to specific words in audio
- Create perfectly synced subtitles
- Search within transcripts
- Edit audio at transcript points
- Analyze speech patterns
Intelligent Caching
- Cache Location:
/tmp/transcribe-cache/ - TTL: 24 hours
- Cache Key: File hash + language + model
# First time: ~10-60 seconds
{baseDir}/transcribe.js audio.mp3 --format json
# Second time: ~1 second (cache hit)
{baseDir}/transcribe.js audio.mp3 --format json
# Force fresh: ~10-60 seconds
{baseDir}/transcribe.js audio.mp3 --format json --no-cache
Rate Limiting
Built-in protection against API limits:
- Max 60 requests per minute
- Automatic delays between requests
- Sequential processing for safety
Cost Optimization:
- Groq Whisper Turbo: Free tier available
- Cached results cost nothing
- Segmented files use 1 request per segment
Error Handling
Error Codes
| Code | Name | Description |
|---|---|---|
| 0 | SUCCESS | Transcription complete |
| 1 | INVALID_INPUT | Bad parameters |
| 2 | FILE_NOT_FOUND | Audio file missing |
| 3 | FILE_TOO_LARGE | Exceeds 2 hours |
| 4 | UNSUPPORTED_FORMAT | Can't process format |
| 5 | API_KEY_MISSING | GROQ_API_KEY not set |
| 6 | API_ERROR | Request failed |
| 7 | RATE_LIMITED | API throttling |
| 8 | NETWORK_ERROR | Connection issue |
| 9 | TIMEOUT | Request took too long |
| 10 | AUDIO_PROCESSING_ERROR | FFmpeg failed |
| 11 | SEGMENTATION_ERROR | Splitting failed |
| 12 | INTERRUPTED | User cancelled |
| 99 | UNKNOWN | Unexpected error |
Common Errors
"API key not found"
# Solution: Set the environment variable
export GROQ_API_KEY="gsk_your_key"
echo "export GROQ_API_KEY=gsk_your_key" >> ~/.zshrc # Persist
"File too large"
# Video duration exceeds 2 hours
# Solution: Split manually first
ffmpeg -i long.mp4 -ss 0 -t 7200 first.mp4
ffmpeg -i long.mp4 -ss 7200 -t 7200 second.mp4
"Rate limited"
# Too many requests
# Solution: Wait 1 minute, try again
# Or add delay between batch operations
Technical Details
Processing Pipeline
1. Validate Input
├── Check file exists
├── Check format supported
├── Probe audio metadata
└── Validate size/duration
2. Check Cache
└── Return cached if available
3. Preprocess (if needed)
├── Convert to MP3
├── Set sample rate to 16kHz
└── Normalize to mono
4. Split (if >25MB)
└── Create 10-minute segments
5. Transcribe
├── Rate-limited requests
├── Word-level timestamps
└── Progress tracking
6. Merge (if segmented)
└── Adjust timestamps
7. Format Output
└── Apply selected format
8. Cache Result
└── Store for 24 hours
API Configuration
Groq (Default):
- Endpoint:
api.groq.com/v1/audio/transcriptions - Model:
whisper-large-v3-turbo - Max file size: 25MB per request
- Word-level timestamps: Yes
- Cost: Free tier: $0.0013/minute
OpenAI (Optional):
- Endpoint:
api.openai.com/v1/audio/transcriptions - Model:
whisper-1 - Max file size: 25MB per request
- Word-level timestamps: Yes
- Cost: $0.006/minute
Timestamp Adjustment
For segmented files, timestamps are adjusted:
Segment 1: [0:00 - 10:00] → [0:00 - 10:00]
Segment 2: [0:00 - 10:00] → [10:00 - 20:00]
Segment 3: [0:00 - 10:00] → [20:00 - 30:00]
Example:
Segment 2 word: "discussion", start: 5:30
Adjusted timestamp: 5:30 + 10:00 = 15:30
Examples
Transcribe Meeting Recording
#!/bin/bash
MEETING="meeting-$(date +%Y%m%d).mp3"
echo "Transcribing meeting..."
{baseDir}/transcribe.js "$MEETING" --format txt --output "$MEETING.txt"
{baseDir}/transcribe.js "$MEETING" --format srt --output "$MEETING.srt"
{baseDir}/transcribe.js "$MEETING" --format json --output "$MEETING.json"
echo "Done: $MEETING.{txt,srt,json}"
Batch Transcribe Directory
#!/bin/bash
mkdir -p transcripts
for audio in *.mp3 *.m4a *.wav; do
[ -f "$audio" ] || continue
echo "Processing: $audio"
base="${audio%.*}"
{baseDir}/transcribe.js "$audio" --format srt --output "transcripts/${base}.srt" 2>/dev/null
if [ $? -eq 0 ]; then
echo " ✓ Created transcripts/${base}.srt"
else
echo " ✗ Failed"
fi
sleep 1 # Rate limit protection
done
Create Searchable Meeting Archive
#!/bin/bash
INPUT="meeting.mp3"
# Transcribe with word timings
{baseDir}/transcribe.js "$INPUT" --format json --output meeting.json
# Extract all utterances with timestamps
jq -r '
.words[] |
"\(.start | tostring | split(".") | .[0] + "." + .[1][:2])\t\(.word)"
' meeting.json > meeting-by-words.txt
# Create time-indexed file
echo "Meeting transcript indexed by time" > index.txt
while IFS=$'\t' read -r time word; do
echo "$time: $word" >> index.txt
done < meeting-by-words.txt
echo "Archive created: index.txt"
Subtitle Synchronization
#!/bin/bash
VIDEO="video.mp4"
AUDIO="video.m4a" # Extracted audio
# Get word-level transcription
{baseDir}/transcribe.js "$AUDIO" --format json --output transcription.json
# Create SRT with optimized line breaks
jq -r '
def format_srt_time(seconds):
[ (seconds / 3600 | floor),
(seconds % 3600 / 60 | floor),
(seconds % 60 | floor),
(seconds % 1 * 1000 | floor)
] |
[.[]] as [$h, $m, $s, $ms] |
"\($h | tostring | split("") | (. | length | if . < 2 then ["0"] + $h else $h end) | add):\($m | tostring | split("") | (. | length | if . < 2 then ["0"] + $m else $m end) | add):\($s | tostring | split("") | (. | length | if . < 2 then ["0"] + $s else $s end) | add),\($ms | tostring | split("") | (. | length | if . < 3 then ["0"] + $ms else $ms end) | add)";
"WEBVTT",
"",
(.words | map(.word) | join(" ") | split("\\. ") | .[] | select(length > 0) |
{ text: ., start: ., end: . })
|
"\(format_srt_time(.start)) --> \(format_srt_time(.end))",
"\(.text)"
' transcription.json > subtitles.srt
echo "SRT subtitles created: subtitles.srt"
Extract Keywords with Timestamps
#!/bin/bash
AUDIO="recording.mp3"
KEYWORDS=("budget" "timeline" "decision")
# Transcribe
{baseDir}/transcribe.js "$AUDIO" --format json --output data.json
# Find keywords with timestamps
echo "Keyword timestamps:"
for kw in "${KEYWORDS[@]}"; do
jq -r --arg kw "${kw,,}" '.words[] | select(.word | ascii_downcase | contains($kw)) | "\(.word) at \(.start)s"' data.json
done
Performance Tips
1. Use Cache
# First time (slow)
{baseDir}/transcribe.js audio.mp3
# Second time (fast)
{baseDir}/transcribe.js audio.mp3
# Same file, different format - different cache
{baseDir}/transcribe.js audio.mp3 --format srt # New cache entry
2. Specify Language
# Auto-detect (slower first pass)
{baseDir}/transcribe.js spanish.mp3
# Specify language (faster, more accurate)
{baseDir}/transcribe.js spanish.mp3 --language es
3. Pre-extract Audio
# Slower: video with embedded audio
{baseDir}/transcribe.js video.mp4
# Faster: pre-extracted audio
ffmpeg -i video.mp4 -vn -c:a libmp3lame -b:a 192k audio.mp3
{baseDir}/transcribe.js audio.mp3
4. Batch Processing
# Process multiple files
for f in *.mp3; do
{baseDir}/transcribe.js "$f" &
done
wait
5. Parallel Segments
# Large files process segments in parallel
# 30-minute file with 3 segments
# Elapsed time: ~60 seconds (3x faster than sequential)
Notes
- Maximum file duration: 2 hours
- Maximum file size for direct upload: 25MB
- Caching includes format in key (different formats = different caches)
- API rate limits: 60 requests/minute
- Segment size: 10 minutes (configurable in code)
- Output format affects cache (srt and json cached separately)
- Word timestamps provide ~50ms precision
- SRT/VTT formats group words into phrases (~5 words)
- TSV/CSV provide per-word timestamps
- JSON includes all metadata and word-level data
- Audio preprocessing preserves quality while optimizing for Whisper
- FFmpeg required for format conversion and segmentation
- Network errors retry up to 3 times with exponential backoff
Weekly Installs
4
Repository
winsorllc/upgra…carnivalFirst Seen
14 days ago
Security Audits
Installed on
gemini-cli4
github-copilot4
codex4
kimi-cli4
cursor4
amp4