audio-to-text

SKILL.md

Audio to Text Skill

Transcribe audio files to text with automatic language detection.

Features

  • Apple Silicon optimized using mlx-whisper
  • Automatic language detection (Chinese/English)
  • Chunked processing for long audio files (up to 5 hours)
  • Resume from interruption support
  • Progress tracking
  • Multiple output formats (txt, srt, json)

Usage

Basic Transcription

# Transcribe to default output (same directory as input)
python {baseDir}/scripts/transcribe.py "<audio_file>"

# Transcribe with custom output path
python {baseDir}/scripts/transcribe.py "<audio_file>" -o output.txt

# Use specific model (default: small)
python {baseDir}/scripts/transcribe.py "<audio_file>" --model medium

Options

  • --model: Model size (tiny, base, small, medium, large-v3). Default: small
  • --chunk-minutes: Minutes per chunk for long audio. Default: 15
  • --format: Output format (txt, srt, json). Default: txt
  • --language: Force specific language (auto-detect if not specified)

Examples

python {baseDir}/scripts/transcribe.py podcast.mp3
python {baseDir}/scripts/transcribe.py interview.wav -o transcript.txt --model medium
python {baseDir}/scripts/transcribe.py lecture.mp3 --format srt --chunk-minutes 10

Output Format

TXT Format

Plain text with paragraphs.

SRT Format

SubRip subtitle format with timestamps.

JSON Format

{
  "language": "zh",
  "segments": [
    {"start": 0.0, "end": 5.2, "text": "..."}
  ],
  "text": "..."
}

Troubleshooting

Out of Memory

Use a smaller model or increase chunk size.

First Run Slow

The first run will download the model from Hugging Face (150MB-3GB depending on model size).

Performance

mlx-whisper is optimized for Apple Silicon and runs ~30% faster than other implementations on M-series chips.

Weekly Installs
2
GitHub Stars
13
First Seen
Feb 19, 2026
Installed on
replit2
openclaw2
mcpjam1
claude-code1
windsurf1
zencoder1