audio-transcribe
Audio Transcriber
Speech recognition using WhisperX with multi-language support and word-level timestamp alignment.
Prerequisites
Requires Python 3.12 (uv manages this automatically).
Usage
When the user wants to transcribe audio/video: $ARGUMENTS
Instructions
Step 1: Get input file
If the user has not provided an input file path, ask them to provide one.
Supported formats:
- Audio: MP3, WAV, FLAC, M4A, OGG, etc.
- Video: MP4, MKV, MOV, AVI, etc. (audio is extracted automatically)
Verify the file exists:
ls -la "$INPUT_FILE"
Step 2: Ask user for configuration
Warning: You MUST use AskUserQuestion to collect user preferences. Do not skip this step.
Use AskUserQuestion to collect the following:
-
Model size: Choose the recognition model
- Options:
- "base - Balanced speed and accuracy (Recommended)"
- "tiny - Fastest, lower accuracy"
- "small - Faster, moderate accuracy"
- "medium - Slower, higher accuracy"
- "large-v2 - Slowest, highest accuracy"
- Options:
-
Language: What language is the audio?
- Options:
- "Auto-detect (Recommended)"
- "Chinese (zh)"
- "English (en)"
- "Japanese (ja)"
- "Other"
- Options:
-
Word-level alignment: Do you need word-level timestamps?
- Options:
- "Yes - Precise timing for each word (Recommended)"
- "No - Sentence-level timing only (faster)"
- Options:
-
Output format: What format to output?
- Options:
- "TXT - Plain text with timestamps (Recommended)"
- "SRT - Subtitle format"
- "VTT - Web subtitle format"
- "JSON - Structured data (with word-level info)"
- Options:
-
Output path: Where to save?
- Default: same directory as input file, named
<original_name>.txt(or matching format)
- Default: same directory as input file, named
Step 3: Run transcription script
Use the transcribe.py script in the skill directory:
uv run /path/to/skills/audio-transcribe/transcribe.py "INPUT_FILE" [OPTIONS]
Parameters:
--model,-m: Model size (tiny/base/small/medium/large-v2)--language,-l: Language code (en/zh/ja/...), auto-detect if not specified--no-align: Skip word-level alignment--no-vad: Disable VAD filtering (use if transcription has time jumps or missing segments)--output,-o: Output file path--format,-f: Output format (srt/vtt/txt/json)
Examples:
# Basic transcription (auto-detect language)
uv run skills/audio-transcribe/transcribe.py "video.mp4" -o "video.txt"
# Chinese transcription, output SRT subtitles
uv run skills/audio-transcribe/transcribe.py "audio.mp3" -l zh -f srt -o "subtitles.srt"
# Fast transcription, skip word alignment
uv run skills/audio-transcribe/transcribe.py "audio.wav" --no-align -o "transcript.txt"
# Use a larger model, output JSON (with word-level timestamps)
uv run skills/audio-transcribe/transcribe.py "speech.mp3" -m medium -f json -o "result.json"
# Disable VAD filtering (fix time jumps / missing segments)
uv run skills/audio-transcribe/transcribe.py "audio.mp3" --no-vad -o "transcript.txt"
Step 4: Present results
After transcription completes:
- Show the full output file path
- Display a preview of the transcription content
- Report total duration and segment count
Output format reference
TXT format
[00:00:00.000 - 00:00:03.500] This is the first sentence
[00:00:03.500 - 00:00:07.200] This is the second sentence
SRT format
1
00:00:00,000 --> 00:00:03,500
This is the first sentence
2
00:00:03,500 --> 00:00:07,200
This is the second sentence
JSON format (with word-level)
[
{
"start": 0.0,
"end": 3.5,
"text": "This is the first sentence",
"words": [
{"word": "This", "start": 0.0, "end": 0.5, "score": 0.95},
...
]
}
]
Troubleshooting
Slow on first run:
- WhisperX needs to download model files; first run will be slower
- Subsequent runs use the cached model
Out of memory:
- Use a smaller model (tiny or base)
- Ensure the system has enough memory
Low recognition accuracy:
- Try a larger model (medium or large-v2)
- Explicitly specify the language instead of auto-detect
More from maxgent-ai/maxgent-plugin
memory
Read long-term memory files to get historical context, code references, and error fix records. Use when user wants to read memory, get context, check history, avoid repeating errors.
12video-gen
AI video generation with text-to-video, image-to-video, and first/last frame control. Use when users ask to generate or create videos from text prompts or images.
10youtube-download
Download videos, audio, or subtitles from YouTube, Bilibili, and other sites using yt-dlp. Use when users ask to download online videos or extract audio from video URLs.
9image-gen
AI image generation and editing. Use when users ask to generate, create, or draw images with AI, or edit and modify existing images.
6browser
Browser automation with persistent page state. Use when users ask to navigate websites, fill forms, take screenshots, extract web data, test web apps, or automate browser workflows. Trigger phrases include "go to [url]", "click on", "fill out the form", "take a screenshot", "scrape", "automate", "test the website", "log into", or any browser interaction request.
5media-understand
AI-powered media understanding and analysis for images, videos, and audio. Use when users ask to describe, analyze, summarize, or extract text (OCR) from media files.
5