parakeet
Parakeet-MLX Audio Transcription
This skill provides instructions for using parakeet-mlx to convert audio files to text using NVIDIA's Parakeet automatic speech recognition model optimized for Apple's MLX framework.
Overview
Parakeet-MLX brings NVIDIA's Parakeet ASR model to Apple Silicon, enabling fast, high-quality on-device speech-to-text conversion similar to Whisper but optimized for Apple's MLX framework.
Key Features:
- High-quality transcription with timestamps
- Fast processing (e.g., 1+ hour audio in ~53 seconds)
- On-device processing (no cloud API required)
- Output in SRT subtitle format
- No installation required - run directly with
uvx
Basic Usage
Transcribe an Audio File
The simplest way to transcribe an audio file is:
uvx parakeet-mlx /path/to/audio/file.mp3
This will:
- Download the model on first run (~2.5GB)
- Process the audio file
- Generate an SRT subtitle file with timestamped transcription
Example:
# Transcribe a podcast episode
uvx parakeet-mlx podcast_episode.mp3
# Transcribe an interview
uvx parakeet-mlx interview.wav
# Transcribe a meeting recording
uvx parakeet-mlx meeting_recording.m4a
Important Notes
First Run
The first time you run parakeet-mlx, it will download a ~2.5GB model file. This may take several minutes depending on your internet connection. Subsequent runs will use the cached model and start immediately.
Performance
Parakeet-MLX is optimized for Apple Silicon and provides excellent performance:
- A 65MB, 1 hour 1 minute 28 second podcast was transcribed in 53 seconds
- Performance scales with audio duration, not file size
- Processing happens entirely on-device
Output Format
The tool generates an SRT (SubRip Subtitle) file with the same name as your input file:
Input: podcast.mp3
Output: podcast.srt
The SRT format includes:
- Sequential subtitle numbers
- Timestamp ranges (start --> end)
- Transcribed text for each segment
Example SRT output:
1
00:00:00,000 --> 00:00:03,500
Welcome to the podcast. Today we're discussing...
2
00:00:03,500 --> 00:00:08,200
The latest developments in machine learning...
Supported Audio Formats
While the documentation explicitly mentions MP3, parakeet-mlx likely supports common audio formats including:
- MP3
- WAV
- M4A
- FLAC
- OGG
If you encounter format issues, consider converting your audio file to MP3 or WAV first using tools like ffmpeg.
Working with Output
View the Transcription
# View the entire transcription
cat output.srt
# View just the text (remove timestamps)
grep -v "^[0-9]*$" output.srt | grep -v "^[0-9][0-9]:[0-9][0-9]:[0-9][0-9]"
Convert SRT to Plain Text
If you need plain text without timestamps:
# Using the helper script (recommended)
python3 srt_to_text.py output.srt > transcript.txt
# Or with awk
awk 'NF && !/^[0-9]+$/ && !/^[0-9]{2}:[0-9]{2}:[0-9]{2}/' output.srt > transcript.txt
Helper Scripts
srt_to_text.py
This skill includes a Python helper script for converting SRT files to plain text. Located at skills/parakeet/srt_to_text.py.
Usage:
# Basic conversion (outputs to stdout)
python3 srt_to_text.py transcript.srt
# Save to file
python3 srt_to_text.py transcript.srt --output transcript.txt
# Preserve paragraph breaks between subtitle blocks
python3 srt_to_text.py transcript.srt --paragraphs
# Pipe to other tools
python3 srt_to_text.py transcript.srt | llm "Summarize this"
Features:
- Strips subtitle numbers and timestamps
- Joins text into continuous prose (or preserves paragraphs with
--paragraphs) - Supports stdin input for piping
- Clean output suitable for LLM processing
See also: The llm skill includes an audio-to-article.yaml template designed to clean up transcripts from this helper into polished articles.
Use with Other Tools
The SRT format is widely supported and can be:
- Imported into video editing software
- Used with subtitle players
- Converted to other formats (VTT, ASS, etc.)
- Processed with text analysis tools
Common Workflows
Transcribe Multiple Files
# Process all MP3 files in a directory
for file in *.mp3; do
echo "Processing $file..."
uvx parakeet-mlx "$file"
done
Transcribe and Extract Text
# Transcribe and immediately extract plain text
uvx parakeet-mlx audio.mp3
awk 'NF && !/^[0-9]+$/ && !/^[0-9]{2}:[0-9]{2}:[0-9]{2}/' audio.srt > audio.txt
Check Transcription Quality
After transcription, review the SRT file to verify:
- Proper segmentation of speech
- Accurate timestamps
- Text quality and accuracy
Troubleshooting
Model Download Issues
If the model download fails or is interrupted:
- Check your internet connection
- Ensure you have ~2.5GB of free disk space
- Try running the command again (it may resume download)
Performance Issues
If transcription is slow:
- Ensure you're running on Apple Silicon (M1/M2/M3)
- Close other resource-intensive applications
- For very long files, consider splitting them into smaller segments
Audio Format Not Supported
If you get an error about unsupported format:
# Convert to MP3 using ffmpeg
ffmpeg -i input.m4a -acodec libmp3lame -ab 192k output.mp3
uvx parakeet-mlx output.mp3
Use Cases
Parakeet-MLX is ideal for:
- Podcast transcription - Generate searchable text from episodes
- Interview documentation - Convert recorded interviews to text
- Meeting notes - Transcribe meeting recordings for reference
- Content creation - Generate captions or show notes
- Accessibility - Create subtitles for audio/video content
- Research - Analyze spoken content at scale
Advantages Over Other Tools
vs. Whisper:
- Optimized specifically for Apple MLX framework
- Potentially faster on Apple Silicon
- Similar quality output
vs. Cloud APIs:
- No API costs
- Complete privacy (on-device processing)
- No internet required after model download
- No file size limits
vs. Manual Transcription:
- Dramatically faster (hours to minutes)
- Consistent quality
- Includes precise timestamps
Tips for Best Results
- Audio Quality Matters: Clear audio with minimal background noise produces better results
- Speaker Clarity: Single speaker or well-separated multi-speaker audio works best
- Review Output: Always review the transcription for accuracy, especially for technical terms or names
- Use Timestamps: The SRT format's timestamps are valuable for referencing specific moments
- Batch Processing: Process multiple files in sequence for efficiency
References
- Simon Willison's parakeet-mlx article
- Run
uvx parakeet-mlx --helpfor command-line options (if available)
Notes
- Parakeet-MLX runs entirely on-device, ensuring privacy
- The tool is designed for Apple Silicon; performance on Intel Macs may vary
- First run requires internet for model download
- Output quality is reported as "very high" for clear audio
- The tool is optimized for speech recognition, not music or other audio types
More from seckatie/katies-ai-skills
jj-vcs
Jujutsu (jj) is a powerful Git-compatible version control system with innovative features like automatic rebasing, working-copy-as-a-commit, operation log with undo, and first-class conflict tracking. This skill is triggered when the user says things like "use jj", "run jj commands", "jujutsu version control", "migrate from git to jj", "jj rebase", "jj squash", "jj log", or "help with jj workflow".
36mermaid
Generate diagrams and flowcharts from mermaid definitions using the mermaid-cli (mmdc). Supports themes, custom CSS, and various output formats including SVG, PNG, and PDF. Mermaid supports 20+ diagram types including flowcharts, sequence diagrams, class diagrams, state diagrams, entity relationship diagrams, user journeys, Gantt charts, pie charts, quadrant charts, requirement diagrams, GitGraph, C4 diagrams, mindmaps, timelines, ZenUML, Sankey diagrams, XY charts, block diagrams, packet diagrams, Kanban boards, architecture diagrams, radar charts, and treemaps. This skill is triggered when the user says things like "create a diagram", "make a flowchart", "generate a sequence diagram", "create a mermaid chart", "visualize this as a diagram", "render mermaid code", or "create an architecture diagram".
10piper
Convert text to speech using Piper TTS. This skill is triggered when the user says things like "convert text to speech", "text to audio", "read this aloud", "create audio from text", "generate speech from text", "make an audio file from this text", or "use piper TTS".
10just
just is a handy command runner for saving and running project-specific commands. Features include recipe parameters, .env file loading, shell completion, cross-platform support, and recipes in arbitrary languages. This skill is triggered when the user says things like "create a justfile", "write a just recipe", "run just commands", "set up project automation with just", "understand justfile syntax", or "add a task to the justfile".
9yt-dlp
Download audio and video from thousands of websites using yt-dlp. Feature-rich command-line tool supporting format selection, subtitle extraction, playlist handling, metadata embedding, and post-processing. This skill is triggered when the user says things like "download this video", "download from YouTube", "extract audio from video", "download this playlist", "get the mp3 from this video", "download subtitles", or "save this video locally".
9llm
Access and interact with Large Language Models from the command line using Simon Willison's llm CLI tool. Supports OpenAI, Anthropic, Gemini, Llama, and dozens of other models via plugins. Features include chat sessions, embeddings, structured data extraction with schemas, prompt templates, conversation logging, and tool use. This skill is triggered when the user says things like "run a prompt with llm", "use the llm command", "call an LLM from the command line", "set up llm API keys", "install llm plugins", "create embeddings", or "extract structured data from text".
8