transcribe

SKILL.md

Transcribe Skill

Production-grade speech-to-text transcription with intelligent file handling, multiple output formats, and parallel processing.

When to Use

USE this skill when:

  • Transcribing audio recordings to text
  • Creating subtitles for video content
  • Converting speech to searchable text
  • Needing word-level timestamps
  • Processing podcasts or meeting recordings
  • Transcribing interviews
  • Converting audio notes to text
  • Creating transcripts for video editing

DON'T use this skill when:

  • Transcribing YouTube videos → Use youtube-transcript (faster, no API cost)
  • Real-time transcription → Use streaming tools
  • Already have captions → Use youtube-transcript
  • Need video-specific processing → Use ffmpeg-tools first

Prerequisites

# 1. Get Groq API key
# Visit: https://console.groq.com/
# Create an API key

# 2. Set environment variable
export GROQ_API_KEY="gsk_your_api_key_here"

# 3. Install FFmpeg (for audio processing)
brew install ffmpeg        # macOS
sudo apt install ffmpeg    # Ubuntu/Debian

# 4. Verify
node --version  # Should show version

Commands

Basic Usage

# Basic transcription (outputs plain text)
{baseDir}/transcribe.js audio.m4a

# Transcribe with specific output format
{baseDir}/transcribe.js audio.mp3 --format srt --output subtitles.srt
{baseDir}/transcribe.js meeting.wav --format json --output result.json

# Specify language for better accuracy
{baseDir}/transcribe.js spanish.mp3 --language es --format text
{baseDir}/transcribe.js audio.mp3 --language de --format vtt

Output Formats

# Plain text (default)
{baseDir}/transcribe.js audio.mp3 --format text
Transcriber output follows without timestamps.

# JSON with detailed data
{baseDir}/transcribe.js audio.mp3 --format json
{
  "text": "Transcription text...",
  "duration": 123.45,
  "language": "en",
  "words": [{"word": "Transcription", "start": 0.0, "end": 0.5}, ...]
}

# SRT subtitles
{baseDir}/transcribe.js audio.mp3 --format srt --output subtitles.srt
1
00:00:00,000 --> 00:00:05,500
Transcription of the audio begins here

2
00:00:05,500 --> 00:00:11,200
And continues in the next segment

# VTT subtitles
{baseDir}/transcribe.js audio.mp3 --format vtt --output captions.vtt
WEBVTT

00:00.000 --> 00:05.500
Transcription of the audio begins here

# Word timings TSV
{baseDir}/transcribe.js audio.mp3 --format tsv
start\tend\tword
0.000\t0.450\tTranscription
0.450\t0.820\tof
0.820\t1.240\tthe

# Word timings CSV
{baseDir}/transcribe.js audio.mp3 --format csv
start,end,word
0.000,0.450,"Transcription"
0.450,0.820,"of"
0.820,1.240,"the"

Format Comparison:

Format Use Case Word Timestamps File Size
text General use Small
json API integration Large
srt Subtitles ⚠️ Phrases Medium
vtt Web captions ⚠️ Phrases Medium
tsv Spreadsheet Medium
csv Database import Medium
word_timings Analysis Large

Language Selection

# Auto-detect (default)
{baseDir}/transcribe.js audio.mp3

# Specify language for better accuracy
{baseDir}/transcribe.js audio.mp3 --language en   # English
{baseDir}/transcribe.js audio.mp3 --language es   # Spanish
{baseDir}/transcribe.js audio.mp3 --language fr   # French
{baseDir}/transcribe.js audio.mp3 --language de   # German
{baseDir}/transcribe.js audio.mp3 --language ja   # Japanese

Supported Languages: All 99 languages supported by Whisper

Large File Processing

# Files >25MB are automatically segmented
{baseDir}/transcribe.js long-recording.mp3

# Progress shown for segmented files
⏳ Transcribing: Segment 3/12 (25.0%) | Elapsed: 45.2s

# Output combined automatically

Cache Control

# Use cache (default) - instant for previously transcribed
{baseDir}/transcribe.js audio.mp3

# Force fresh transcription
{baseDir}/transcribe.js audio.mp3 --no-cache

API Provider Selection

# Use Groq (default) - faster, cheaper
{baseDir}/transcribe.js audio.mp3 --provider groq

# Use OpenAI Whisper (requires OPENAI_API_KEY)
{baseDir}/transcribe.js audio.mp3 --provider openai

Supported Audio Formats

Format Extension Notes
MP3 .mp3 Best compatibility
MP4 .mp4, .m4a iOS recordings
WAV .wav Uncompressed, large files
OGG .ogg, .oga, .ogv Open format
FLAC .flac Lossless compression
WebM .webm Web audio/videos
AAC .aac Apple format
WMA .wma Windows format

Audio Preprocessing:

  • Unsupported formats are auto-converted to MP3
  • Sample rate normalized to 16kHz (Whisper optimal)
  • Mono channel for better accuracy
  • Bitrate: 192kbps MP3

Features

Automatic Segmentation

Large audio files are automatically split for processing:

Audio File >25MB
    ↓ FFmpeg
Convert to MP3 (16kHz, mono)
Split into 10-minute segments
Transcribe segments in parallel
Merge results with adjusted timestamps

Segmentation Benefits:

  • ✓ Handles recordings up to 2 hours
  • ✓ Respects API rate limits
  • ✓ Parallel processing for speed
  • ✓ Seamless results (timestamps adjusted)

Word-Level Timestamps

Each word includes start and end timestamps:

{
  "words": [
    {"word": "Hello", "start": 0.000, "end": 0.320},
    {"word": "and", "start": 0.320, "end": 0.560},
    {"word": "welcome", "start": 0.560, "end": 0.980},
    {"word": "everyone", "start": 0.980, "end": 1.420}
  ]
}

Uses for Timestamps:

  • Jump to specific words in audio
  • Create perfectly synced subtitles
  • Search within transcripts
  • Edit audio at transcript points
  • Analyze speech patterns

Intelligent Caching

  • Cache Location: /tmp/transcribe-cache/
  • TTL: 24 hours
  • Cache Key: File hash + language + model
# First time: ~10-60 seconds
{baseDir}/transcribe.js audio.mp3 --format json

# Second time: ~1 second (cache hit)
{baseDir}/transcribe.js audio.mp3 --format json

# Force fresh: ~10-60 seconds
{baseDir}/transcribe.js audio.mp3 --format json --no-cache

Rate Limiting

Built-in protection against API limits:

  • Max 60 requests per minute
  • Automatic delays between requests
  • Sequential processing for safety

Cost Optimization:

  • Groq Whisper Turbo: Free tier available
  • Cached results cost nothing
  • Segmented files use 1 request per segment

Error Handling

Error Codes

Code Name Description
0 SUCCESS Transcription complete
1 INVALID_INPUT Bad parameters
2 FILE_NOT_FOUND Audio file missing
3 FILE_TOO_LARGE Exceeds 2 hours
4 UNSUPPORTED_FORMAT Can't process format
5 API_KEY_MISSING GROQ_API_KEY not set
6 API_ERROR Request failed
7 RATE_LIMITED API throttling
8 NETWORK_ERROR Connection issue
9 TIMEOUT Request took too long
10 AUDIO_PROCESSING_ERROR FFmpeg failed
11 SEGMENTATION_ERROR Splitting failed
12 INTERRUPTED User cancelled
99 UNKNOWN Unexpected error

Common Errors

"API key not found"

# Solution: Set the environment variable
export GROQ_API_KEY="gsk_your_key"
echo "export GROQ_API_KEY=gsk_your_key" >> ~/.zshrc  # Persist

"File too large"

# Video duration exceeds 2 hours
# Solution: Split manually first
ffmpeg -i long.mp4 -ss 0 -t 7200 first.mp4
ffmpeg -i long.mp4 -ss 7200 -t 7200 second.mp4

"Rate limited"

# Too many requests
# Solution: Wait 1 minute, try again
# Or add delay between batch operations

Technical Details

Processing Pipeline

1. Validate Input
   ├── Check file exists
   ├── Check format supported
   ├── Probe audio metadata
   └── Validate size/duration

2. Check Cache
   └── Return cached if available

3. Preprocess (if needed)
   ├── Convert to MP3
   ├── Set sample rate to 16kHz
   └── Normalize to mono

4. Split (if >25MB)
   └── Create 10-minute segments

5. Transcribe
   ├── Rate-limited requests
   ├── Word-level timestamps
   └── Progress tracking

6. Merge (if segmented)
   └── Adjust timestamps

7. Format Output
   └── Apply selected format

8. Cache Result
   └── Store for 24 hours

API Configuration

Groq (Default):

  • Endpoint: api.groq.com/v1/audio/transcriptions
  • Model: whisper-large-v3-turbo
  • Max file size: 25MB per request
  • Word-level timestamps: Yes
  • Cost: Free tier: $0.0013/minute

OpenAI (Optional):

  • Endpoint: api.openai.com/v1/audio/transcriptions
  • Model: whisper-1
  • Max file size: 25MB per request
  • Word-level timestamps: Yes
  • Cost: $0.006/minute

Timestamp Adjustment

For segmented files, timestamps are adjusted:

Segment 1: [0:00 - 10:00] → [0:00 - 10:00]
Segment 2: [0:00 - 10:00] → [10:00 - 20:00]
Segment 3: [0:00 - 10:00] → [20:00 - 30:00]

Example:

Segment 2 word: "discussion", start: 5:30
Adjusted timestamp: 5:30 + 10:00 = 15:30

Examples

Transcribe Meeting Recording

#!/bin/bash
MEETING="meeting-$(date +%Y%m%d).mp3"

echo "Transcribing meeting..."
{baseDir}/transcribe.js "$MEETING" --format txt --output "$MEETING.txt"
{baseDir}/transcribe.js "$MEETING" --format srt --output "$MEETING.srt"
{baseDir}/transcribe.js "$MEETING" --format json --output "$MEETING.json"

echo "Done: $MEETING.{txt,srt,json}"

Batch Transcribe Directory

#!/bin/bash
mkdir -p transcripts

for audio in *.mp3 *.m4a *.wav; do
  [ -f "$audio" ] || continue
  
  echo "Processing: $audio"
  base="${audio%.*}"
  
  {baseDir}/transcribe.js "$audio" --format srt --output "transcripts/${base}.srt" 2>/dev/null
  
  if [ $? -eq 0 ]; then
    echo "  ✓ Created transcripts/${base}.srt"
  else
    echo "  ✗ Failed"
  fi
  
  sleep 1  # Rate limit protection
done

Create Searchable Meeting Archive

#!/bin/bash
INPUT="meeting.mp3"

# Transcribe with word timings
{baseDir}/transcribe.js "$INPUT" --format json --output meeting.json

# Extract all utterances with timestamps
jq -r '
  .words[] | 
  "\(.start | tostring | split(".") | .[0] + "." + .[1][:2])\t\(.word)"
' meeting.json > meeting-by-words.txt

# Create time-indexed file
echo "Meeting transcript indexed by time" > index.txt
while IFS=$'\t' read -r time word; do
  echo "$time: $word" >> index.txt
done < meeting-by-words.txt

echo "Archive created: index.txt"

Subtitle Synchronization

#!/bin/bash
VIDEO="video.mp4"
AUDIO="video.m4a"  # Extracted audio

# Get word-level transcription
{baseDir}/transcribe.js "$AUDIO" --format json --output transcription.json

# Create SRT with optimized line breaks
jq -r '
  def format_srt_time(seconds):
    [ (seconds / 3600 | floor),
      (seconds % 3600 / 60 | floor),
      (seconds % 60 | floor),
      (seconds % 1 * 1000 | floor)
    ] | 
    [.[]] as [$h, $m, $s, $ms] |
    "\($h | tostring | split("") | (. | length | if . < 2 then ["0"] + $h else $h end) | add):\($m | tostring | split("") | (. | length | if . < 2 then ["0"] + $m else $m end) | add):\($s | tostring | split("") | (. | length | if . < 2 then ["0"] + $s else $s end) | add),\($ms | tostring | split("") | (. | length | if . < 3 then ["0"] + $ms else $ms end) | add)";
  
  "WEBVTT",
  "",
  (.words | map(.word) | join(" ") | split("\\. ") | .[] | select(length > 0) | 
    { text: ., start: ., end: . })
  | 
  "\(format_srt_time(.start)) --> \(format_srt_time(.end))",
  "\(.text)"
' transcription.json > subtitles.srt

echo "SRT subtitles created: subtitles.srt"

Extract Keywords with Timestamps

#!/bin/bash
AUDIO="recording.mp3"
KEYWORDS=("budget" "timeline" "decision")

# Transcribe
{baseDir}/transcribe.js "$AUDIO" --format json --output data.json

# Find keywords with timestamps
echo "Keyword timestamps:"
for kw in "${KEYWORDS[@]}"; do
  jq -r --arg kw "${kw,,}" '.words[] | select(.word | ascii_downcase | contains($kw)) | "\(.word) at \(.start)s"' data.json
done

Performance Tips

1. Use Cache

# First time (slow)
{baseDir}/transcribe.js audio.mp3

# Second time (fast)
{baseDir}/transcribe.js audio.mp3

# Same file, different format - different cache
{baseDir}/transcribe.js audio.mp3 --format srt  # New cache entry

2. Specify Language

# Auto-detect (slower first pass)
{baseDir}/transcribe.js spanish.mp3

# Specify language (faster, more accurate)
{baseDir}/transcribe.js spanish.mp3 --language es

3. Pre-extract Audio

# Slower: video with embedded audio
{baseDir}/transcribe.js video.mp4

# Faster: pre-extracted audio
ffmpeg -i video.mp4 -vn -c:a libmp3lame -b:a 192k audio.mp3
{baseDir}/transcribe.js audio.mp3

4. Batch Processing

# Process multiple files
for f in *.mp3; do
  {baseDir}/transcribe.js "$f" &
done
wait

5. Parallel Segments

# Large files process segments in parallel
# 30-minute file with 3 segments
# Elapsed time: ~60 seconds (3x faster than sequential)

Notes

  • Maximum file duration: 2 hours
  • Maximum file size for direct upload: 25MB
  • Caching includes format in key (different formats = different caches)
  • API rate limits: 60 requests/minute
  • Segment size: 10 minutes (configurable in code)
  • Output format affects cache (srt and json cached separately)
  • Word timestamps provide ~50ms precision
  • SRT/VTT formats group words into phrases (~5 words)
  • TSV/CSV provide per-word timestamps
  • JSON includes all metadata and word-level data
  • Audio preprocessing preserves quality while optimizing for Whisper
  • FFmpeg required for format conversion and segmentation
  • Network errors retry up to 3 times with exponential backoff
Weekly Installs
4
First Seen
14 days ago
Installed on
gemini-cli4
github-copilot4
codex4
kimi-cli4
cursor4
amp4