skills/kouko/monkey-knowledge-youtube-skills/mk-youtube-audio-transcribe

mk-youtube-audio-transcribe

SKILL.md

YouTube Audio Transcribe

Transcribe audio files to text using local whisper.cpp (no cloud API required).

Quick Start

/mk-youtube-audio-transcribe <audio_file> [model] [language] [--force]

Parameters

Parameter Required Default Description
audio_file Yes - Path to audio file
model No auto Model: auto, tiny, base, small, medium, large-v3, belle-zh, kotoba-ja
language No auto Language code: en, ja, zh, auto (auto-detect)
--force No false Force re-transcribe even if cached file exists

Examples

  • /mk-youtube-audio-transcribe /path/to/audio/video.m4a - Transcribe with auto model selection
  • /mk-youtube-audio-transcribe video.m4a auto zh - Auto-select best model for Chinese → belle-zh
  • /mk-youtube-audio-transcribe video.m4a auto ja - Auto-select best model for Japanese → kotoba-ja
  • /mk-youtube-audio-transcribe audio.mp3 small en - Use small model, force English
  • /mk-youtube-audio-transcribe podcast.wav medium ja - Use medium model (explicit), Japanese

How it Works

  1. Execute: {baseDir}/scripts/transcribe.sh "<audio_file>" "<model>" "<language>"
  2. Auto-download model if not found (with progress)
  3. Convert audio to 16kHz mono WAV using ffmpeg
  4. Run whisper-cli for transcription
  5. Save full JSON to {baseDir}/data/<filename>.json
  6. Save plain text to {baseDir}/data/<filename>.txt
  7. Return file paths and metadata
┌─────────────────────────────┐
│      transcribe.sh          │
│  audio_file, [model], [lang]│
└──────────────┬──────────────┘
┌─────────────────────────────┐
│   ffmpeg: convert to WAV    │
│   16kHz, mono, pcm_s16le    │
└──────────────┬──────────────┘
┌─────────────────────────────┐
│   whisper-cli: transcribe   │
│   with Metal acceleration   │
└──────────────┬──────────────┘
┌─────────────────────────────┐
│   Save to files             │
│   .json (full) + .txt       │
└──────────────┬──────────────┘
┌─────────────────────────────┐
│   Return file paths         │
│   {file_path, text_file_path}│
└─────────────────────────────┘

Output Format

Success:

{
  "status": "success",
  "file_path": "{baseDir}/data/20091025__VIDEO_ID.json",
  "text_file_path": "{baseDir}/data/20091025__VIDEO_ID.txt",
  "language": "en",
  "duration": "3:32",
  "model": "medium",
  "char_count": 12345,
  "line_count": 100,
  "text_char_count": 10000,
  "text_line_count": 50,
  "cached": false,
  "video_id": "dQw4w9WgXcQ",
  "title": "Video Title",
  "channel": "Channel Name",
  "url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
}

Cache hit (returns existing transcription):

{
  "status": "success",
  "file_path": "{baseDir}/data/20091025__VIDEO_ID.json",
  "cached": true,
  ...
}

Error (general):

{
  "status": "error",
  "message": "Error description"
}

Error (unknown model):

{
  "status": "error",
  "error_code": "UNKNOWN_MODEL",
  "message": "Unknown model: invalid-name",
  "available_models": ["tiny", "base", "small", "medium", "large-v3", "large-v3-turbo", "belle-zh", "kotoba-ja", "kotoba-ja-q5"]
}

When you receive UNKNOWN_MODEL error: suggest a valid model from the available_models list.

Error (model not found):

{
  "status": "error",
  "error_code": "MODEL_NOT_FOUND",
  "message": "Model 'medium' not found. Please download it first.",
  "model": "medium",
  "model_size": "1.4GB",
  "download_url": "https://huggingface.co/...",
  "download_command": "curl -L --progress-bar -o '/path/to/models/ggml-medium.bin' 'https://...' 2>&1"
}

When you receive MODEL_NOT_FOUND error:

  1. Inform user: "Downloading model '{model}' ({model_size})..."
  2. Execute download_command using Bash tool with timeout: 1800000 (30 minutes)
  3. After download completes: re-run the original transcribe command

Error (model corrupted):

{
  "status": "error",
  "error_code": "MODEL_CORRUPTED",
  "message": "Model 'medium' is corrupted or incomplete. Please re-download.",
  "model": "medium",
  "model_size": "1.4GB",
  "expected_sha256": "6c14d5adee5f86394037b4e4e8b59f1673b6cee10e3cf0b11bbdbee79c156208",
  "actual_sha256": "def456...",
  "model_path": "/path/to/models/ggml-medium.bin",
  "download_command": "rm '/path/to/models/ggml-medium.bin' && curl -L --progress-bar -o '/path/to/models/ggml-medium.bin' 'https://...' 2>&1"
}

When you receive MODEL_CORRUPTED error:

  1. Inform user: "Model '{model}' is corrupted. Re-downloading ({model_size})..."
  2. Execute download_command (removes corrupted file and re-downloads) using Bash tool with timeout: 1800000 (30 minutes)
  3. After download completes: re-run the original transcribe command

Output Fields

Field Description
file_path Absolute path to JSON file (with segments)
text_file_path Absolute path to plain text file
language Detected language code
duration Audio duration
model Model used for transcription
char_count Character count of JSON file
line_count Line count of JSON file
text_char_count Character count of plain text file
text_line_count Line count of plain text file
video_id YouTube video ID (from centralized metadata store)
title Video title (from centralized metadata store)
channel Channel name (from centralized metadata store)
url Full video URL (from centralized metadata store)

Filename Format

Output files preserve the input audio filename's unified naming format with date prefix: {YYYYMMDD}__{video_id}.{ext}

Example: 20091025__dQw4w9WgXcQ.json

JSON File Format

The JSON file at file_path contains:

{
  "text": "Full transcription text...",
  "language": "en",
  "duration": "3:32",
  "model": "medium",
  "segments": [
    {
      "start": "00:00:00.000",
      "end": "00:00:05.000",
      "text": "First segment..."
    }
  ]
}

Models

Standard Models

Model Size RAM Speed Accuracy
auto - - - Auto-select based on language (default)
tiny 74MB ~273MB Fastest Low
base 141MB ~388MB Fast Medium
small 465MB ~852MB Moderate Good
medium 1.4GB ~2.1GB Slow High
large-v3 2.9GB ~3.9GB Slowest Best
large-v3-turbo 1.5GB ~2.1GB Moderate High (optimized for speed)

Language-Specialized Models

Model Language Size Description
belle-zh Chinese 1.5GB BELLE-2 Chinese-specialized model
kotoba-ja Japanese 1.4GB kotoba-tech Japanese-specialized model
kotoba-ja-q5 Japanese 513MB Quantized version (faster, smaller)

Auto-Selection (model=auto)

When model is auto (default), the system automatically selects the best model based on language:

Language Auto-Selected Model
zh belle-zh (Chinese-specialized)
ja kotoba-ja (Japanese-specialized)
others medium (general purpose)

Example: /mk-youtube-audio-transcribe video.m4a auto zh → uses belle-zh

Notes

  • File caching: If transcription already exists for this video, it will be reused (returns cached: true)
  • Force refresh: Use --force flag to re-transcribe even if cached file exists
  • Specify language for best results - enables auto-selection of specialized models (zh→belle-zh, ja→kotoba-ja)
  • Use Read tool to get file content from file_path or text_file_path
  • Models must be downloaded before first use - returns MODEL_NOT_FOUND error with download command
  • Uses Metal acceleration on macOS for faster processing
  • Supports auto language detection
  • Audio is converted to 16kHz WAV for optimal results
  • Requires ffmpeg and whisper-cli (pre-built in bin/)

Model Download

Models must be downloaded before transcription. When you receive a MODEL_NOT_FOUND error, execute the download_command with timeout: 1800000.

# In terminal (to see progress bar)
./scripts/download-model.sh medium      # 1.4GB
./scripts/download-model.sh belle-zh    # 1.5GB (Chinese)
./scripts/download-model.sh kotoba-ja   # 1.4GB (Japanese)
./scripts/download-model.sh --list      # Show all available models

Next Step

After transcription completes, invoke /mk-youtube-transcript-summarize with the text_file_path from the output to generate a structured summary:

/mk-youtube-transcript-summarize <text_file_path>

IMPORTANT: Always use the Skill tool to invoke /mk-youtube-transcript-summarize. Do NOT generate summaries directly without loading the skill — it contains critical rules for compression ratio, section structure, data preservation, and language handling.

Weekly Installs
3
First Seen
Feb 25, 2026
Installed on
opencode3
gemini-cli3
codebuddy3
github-copilot3
codex3
kimi-cli3