Video Clipping Expert Knowledge

Cross-Platform Notes

All tools (ffmpeg, ffprobe, yt-dlp, whisper) use identical CLI flags on Windows, macOS, and Linux. The differences are only in shell syntax:

Feature	macOS / Linux	Windows (cmd.exe)
Suppress stderr	`2>/dev/null`	`2>NUL`
Filter output	`\| grep pattern`	`\| findstr pattern`
Delete files	`rm file1 file2`	`del file1 file2`
Null output device	`-f null -`	`-f null -` (same)
ffmpeg subtitle paths	`subtitles=clip.srt`	`subtitles=clip.srt` (relative OK, absolute needs `C\\:/path`)

IMPORTANT: ffmpeg filter paths (-vf "subtitles=...") always need forward slashes. On Windows with absolute paths, escape the colon: subtitles=C\\:/Users/me/clip.srt

Prefer using file_write tool for creating SRT/text files instead of shell echo/heredoc.

yt-dlp Reference

Download with Format Selection

# Best video up to 1080p + best audio, merged
yt-dlp -f "bv[height<=1080]+ba/b[height<=1080]" --restrict-filenames -o "source.%(ext)s" "URL"

# 720p max (smaller, faster)
yt-dlp -f "bv[height<=720]+ba/b[height<=720]" --restrict-filenames -o "source.%(ext)s" "URL"

# Audio only (for transcription-only workflows)
yt-dlp -x --audio-format wav --restrict-filenames -o "audio.%(ext)s" "URL"

Metadata Inspection

# Get full metadata as JSON (duration, title, chapters, available subs)
yt-dlp --dump-json "URL"

# Key fields: duration, title, description, chapters, subtitles, automatic_captions

YouTube Auto-Subtitles

# Download auto-generated subtitles in json3 format (word-level timing)
yt-dlp --write-auto-subs --sub-lang en --sub-format json3 --skip-download --restrict-filenames -o "source" "URL"

# Download manual subtitles if available
yt-dlp --write-subs --sub-lang en --sub-format srt --skip-download --restrict-filenames -o "source" "URL"

# List available subtitle languages
yt-dlp --list-subs "URL"

Useful Flags

--restrict-filenames — safe ASCII filenames (no spaces/special chars) — important on all platforms
--no-playlist — download single video even if URL is in a playlist
-o "template.%(ext)s" — output template (%(ext)s auto-detects format)
--cookies-from-browser chrome — use browser cookies for age-restricted content
--extract-audio / -x — extract audio only
--audio-format wav — convert audio to wav (for whisper)

Whisper Transcription Reference

Audio Extraction for Whisper

# Extract mono 16kHz WAV (whisper's preferred input format)
ffmpeg -i source.mp4 -vn -ar 16000 -ac 1 -y audio.wav

Basic Transcription

# Standard transcription with word-level timestamps
whisper audio.wav --model small --output_format json --word_timestamps true --language en

# Faster alternative (same flags, 4x speed)
whisper-ctranslate2 audio.wav --model small --output_format json --word_timestamps true --language en

Model Sizes

Model	VRAM	Speed	Quality	Use When
tiny	~1GB	Fastest	Rough	Quick previews, testing pipeline
base	~1GB	Fast	OK	Short clips, clear speech
small	~2GB	Good	Good	Default — best balance
medium	~5GB	Slow	Better	Important content, accented speech
large-v3	~10GB	Slowest	Best	Final production, multiple languages

Note: On macOS Apple Silicon, consider mlx-whisper as a faster native alternative.

JSON Output Structure

{
  "text": "full transcript text...",
  "segments": [
    {
      "id": 0,
      "start": 0.0,
      "end": 4.52,
      "text": " Hello everyone, welcome back.",
      "words": [
        {"word": " Hello", "start": 0.0, "end": 0.32, "probability": 0.95},
        {"word": " everyone,", "start": 0.32, "end": 0.78, "probability": 0.91},
        {"word": " welcome", "start": 0.78, "end": 1.14, "probability": 0.98},
        {"word": " back.", "start": 1.14, "end": 1.52, "probability": 0.97}
      ]
    }
  ]
}

segments[].words[] gives word-level timing when --word_timestamps true
probability indicates confidence (< 0.5 = likely wrong)

YouTube json3 Subtitle Parsing

Format Structure

{
  "events": [
    {
      "tStartMs": 1230,
      "dDurationMs": 5000,
      "segs": [
        {"utf8": "hello ", "tOffsetMs": 0},
        {"utf8": "world ", "tOffsetMs": 200},
        {"utf8": "how ", "tOffsetMs": 450},
        {"utf8": "are you", "tOffsetMs": 700}
      ]
    }
  ]
}

Extracting Word Timing

For each event and each segment within it:

word_start_ms = event.tStartMs + seg.tOffsetMs
word_start_secs = word_start_ms / 1000.0
word_text = seg.utf8.trim()

Events without segs are line breaks or formatting — skip them. Events with segs containing only "\n" are newlines — skip them.

SRT Generation from Transcript

SRT Format

1
00:00:00,000 --> 00:00:02,500
First line of caption text

2
00:00:02,500 --> 00:00:05,100
Second line of caption text

Rules for Building Good SRT

Group words into subtitle lines of ~8-12 words (2-3 seconds per line)
Break at natural pause points (periods, commas, clause boundaries)
Keep lines under 42 characters for readability on mobile
Adjust timestamps relative to clip start (subtract clip start time from all timestamps)
Timestamp format: HH:MM:SS,mmm (comma separator, not dot)
Each entry: index line, timestamp line, text line(s), blank line
Use file_write tool to create the SRT file — works identically on all platforms

Styled Captions with ASS Format

For animated/styled captions, use ASS subtitle format instead of SRT:

ffmpeg -i clip.mp4 -vf "subtitles=clip.ass:force_style='FontSize=22,FontName=Arial,Bold=1,PrimaryColour=&H00FFFFFF,OutlineColour=&H00000000,Outline=2,Shadow=1,Alignment=2,MarginV=40'" -c:a copy output.mp4

Key ASS style properties:

PrimaryColour=&H00FFFFFF — white text (AABBGGRR format)
OutlineColour=&H00000000 — black outline
Outline=2 — outline thickness
Alignment=2 — bottom center
MarginV=40 — margin from bottom edge
FontSize=22 — good size for 1080x1920 vertical

FFmpeg Video Processing

Scene Detection

ffmpeg -i input.mp4 -filter:v "select='gt(scene,0.3)',showinfo" -f null - 2>&1

Threshold 0.1 = very sensitive, 0.5 = only major cuts
Parse pts_time: from showinfo output for timestamps
On macOS/Linux pipe through grep showinfo, on Windows pipe through findstr showinfo

Silence Detection

ffmpeg -i input.mp4 -af "silencedetect=noise=-30dB:d=1.5" -f null - 2>&1

d=1.5 = minimum 1.5 seconds of silence
Look for silence_start and silence_end in output

Clip Extraction

# Re-encoded (accurate cuts)
ffmpeg -ss 00:01:30 -to 00:02:15 -i input.mp4 -c:v libx264 -c:a aac -preset fast -crf 23 -movflags +faststart -y clip.mp4

# Lossless copy (fast but may have keyframe alignment issues)
ffmpeg -ss 00:01:30 -to 00:02:15 -i input.mp4 -c copy -y clip.mp4

-ss before -i = fast seek (recommended for extraction)
-to = end timestamp, -t = duration

Vertical Video (9:16 for Shorts/Reels/TikTok)

# Center crop (when source is 16:9)
ffmpeg -i input.mp4 -vf "crop=ih*9/16:ih:(iw-ih*9/16)/2:0,scale=1080:1920" -c:a copy output.mp4

# Scale with letterbox padding (preserves full frame)
ffmpeg -i input.mp4 -vf "scale=1080:1920:force_original_aspect_ratio=decrease,pad=1080:1920:(ow-iw)/2:(oh-ih)/2:black" -c:a copy output.mp4

Caption Burn-in

# SRT subtitles with styling (use relative path or forward-slash absolute path)
ffmpeg -i input.mp4 -vf "subtitles=subs.srt:force_style='FontSize=22,FontName=Arial,PrimaryColour=&H00FFFFFF,OutlineColour=&H00000000,Outline=2,Alignment=2,MarginV=40'" -c:a copy output.mp4

# Simple text overlay
ffmpeg -i input.mp4 -vf "drawtext=text='Caption':fontsize=48:fontcolor=white:borderw=3:bordercolor=black:x=(w-text_w)/2:y=h-th-40" output.mp4

Windows path escaping: subtitles=C\\:/Users/me/subs.srt (double-backslash before colon)

Thumbnail Generation

# At specific time (2 seconds in)
ffmpeg -i input.mp4 -ss 2 -frames:v 1 -q:v 2 -y thumb.jpg

# Best keyframe
ffmpeg -i input.mp4 -vf "select='eq(pict_type,I)',scale=1280:720" -frames:v 1 thumb.jpg

# Contact sheet
ffmpeg -i input.mp4 -vf "fps=1/10,scale=320:-1,tile=4x4" contact.jpg

Video Analysis

# Full metadata (JSON)
ffprobe -v quiet -print_format json -show_format -show_streams input.mp4

# Duration only
ffprobe -v error -show_entries format=duration -of csv=p=0 input.mp4

# Resolution
ffprobe -v error -select_streams v:0 -show_entries stream=width,height -of csv=p=0 input.mp4

API-Based STT Reference

Groq Whisper API

Fastest cloud STT — uses whisper-large-v3 on Groq hardware. Free tier available.

curl -s -X POST "https://api.groq.com/openai/v1/audio/transcriptions" \
  -H "Authorization: Bearer $GROQ_API_KEY" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@audio.wav" \
  -F "model=whisper-large-v3" \
  -F "response_format=verbose_json" \
  -F "timestamp_granularities[]=word" \
  -o transcript_raw.json

Response: {"text": "...", "words": [{"word": "hello", "start": 0.0, "end": 0.32}]}

Max file size: 25MB. For longer audio, split with ffmpeg first.
timestamp_granularities[]=word is required for word-level timing.

OpenAI Whisper API

curl -s -X POST "https://api.openai.com/v1/audio/transcriptions" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@audio.wav" \
  -F "model=whisper-1" \
  -F "response_format=verbose_json" \
  -F "timestamp_granularities[]=word" \
  -o transcript_raw.json

Response format same as Groq. Max 25MB.

Deepgram Nova-2

curl -s -X POST "https://api.deepgram.com/v1/listen?model=nova-2&smart_format=true&utterances=true&punctuate=true" \
  -H "Authorization: Token $DEEPGRAM_API_KEY" \
  -H "Content-Type: audio/wav" \
  --data-binary @audio.wav \
  -o transcript_raw.json

Response: {"results": {"channels": [{"alternatives": [{"words": [{"word": "hello", "start": 0.0, "end": 0.32, "confidence": 0.99}]}]}]}}

Supports streaming, but for clips use batch mode.
smart_format=true adds punctuation and casing.

TTS Reference

Edge TTS (free, no API key needed)

# List available voices
edge-tts --list-voices

# Generate speech
edge-tts --text "Your caption text here" --voice en-US-AriaNeural --write-media tts_output.mp3

# Other good voices: en-US-GuyNeural, en-GB-SoniaNeural, en-AU-NatashaNeural

Install: pip install edge-tts

OpenAI TTS

curl -s -X POST "https://api.openai.com/v1/audio/speech" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"tts-1","input":"Your text here","voice":"alloy"}' \
  --output tts_output.mp3

Voices: alloy, echo, fable, onyx, nova, shimmer Models: tts-1 (fast), tts-1-hd (quality)

ElevenLabs

curl -s -X POST "https://api.elevenlabs.io/v1/text-to-speech/21m00Tcm4TlvDq8ikWAM" \
  -H "xi-api-key: $ELEVENLABS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"text":"Your text here","model_id":"eleven_monolingual_v1"}' \
  --output tts_output.mp3

Voice ID 21m00Tcm4TlvDq8ikWAM = Rachel (default). List voices: GET /v1/voices

Audio Merging (TTS + Original)

# Mix TTS over original audio (original at 30% volume, TTS at 100%)
ffmpeg -i clip.mp4 -i tts.mp3 \
  -filter_complex "[0:a]volume=0.3[orig];[1:a]volume=1.0[tts];[orig][tts]amix=inputs=2:duration=first[out]" \
  -map 0:v -map "[out]" -c:v copy -c:a aac -y clip_voiced.mp4

# Replace audio entirely (no original audio)
ffmpeg -i clip.mp4 -i tts.mp3 -map 0:v -map 1:a -c:v copy -c:a aac -shortest -y clip_voiced.mp4

Quality & Performance Tips

Use -preset ultrafast for quick previews, -preset slow for final output
Use -crf 23 for good quality (18=high, 28=low, lower=bigger files)
Add -movflags +faststart for web-friendly MP4
Use -threads 0 to auto-detect CPU cores
Always use -y to overwrite without asking

Telegram Bot API Reference

sendVideo — Upload and send a video to a chat/channel

curl -s -X POST "https://api.telegram.org/bot<BOT_TOKEN>/sendVideo" \
  -F "chat_id=<CHAT_ID>" \
  -F "video=@clip_N_final.mp4" \
  -F "caption=Clip title here" \
  -F "parse_mode=HTML" \
  -F "supports_streaming=true"

Parameters

Parameter	Required	Description
`chat_id`	Yes	Channel (`-100XXXXXXXXXX` or `@channelname`), group, or user numeric ID
`video`	Yes	`@filepath` for upload (max 50MB) or a Telegram `file_id` for re-send
`caption`	No	Text caption, up to 1024 characters
`parse_mode`	No	`HTML` or `MarkdownV2` for styled captions
`supports_streaming`	No	`true` enables progressive playback

Success Response

{"ok": true, "result": {"message_id": 1234, "video": {"file_id": "BAACAgI...", "file_size": 5242880}}}

Error Response

{"ok": false, "error_code": 400, "description": "Bad Request: chat not found"}

Common Errors

Error Code	Description	Fix
400	Chat not found	Verify chat_id; bot must be added to the channel/group
401	Unauthorized	Bot token is invalid or revoked — regenerate via @BotFather
413	Request entity too large	File exceeds 50MB — re-encode: `ffmpeg -i input.mp4 -fs 49M -c:v libx264 -crf 28 -preset fast -c:a aac -y output.mp4`
429	Too many requests	Rate limited — wait the `retry_after` seconds from the response

File Size Limit

Telegram allows up to 50MB for video uploads via Bot API. If a clip exceeds this:

ffmpeg -i clip_N_final.mp4 -fs 49M -c:v libx264 -crf 28 -preset fast -c:a aac -movflags +faststart -y clip_N_tg.mp4

WhatsApp Business Cloud API Reference

Two-Step Flow: Upload Media → Send Message

WhatsApp Cloud API requires uploading the video first to get a media_id, then sending a message referencing that ID.

Step 1 — Upload Media

curl -s -X POST "https://graph.facebook.com/v21.0/<PHONE_NUMBER_ID>/media" \
  -H "Authorization: Bearer <ACCESS_TOKEN>" \
  -F "file=@clip_N_final.mp4" \
  -F "type=video/mp4" \
  -F "messaging_product=whatsapp"

Success response:

{"id": "1234567890"}

Step 2 — Send Video Message

curl -s -X POST "https://graph.facebook.com/v21.0/<PHONE_NUMBER_ID>/messages" \
  -H "Authorization: Bearer <ACCESS_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{
    "messaging_product": "whatsapp",
    "to": "<RECIPIENT_PHONE>",
    "type": "video",
    "video": {
      "id": "<MEDIA_ID>",
      "caption": "Clip title here"
    }
  }'

Success response:

{"messaging_product": "whatsapp", "contacts": [{"wa_id": "14155551234"}], "messages": [{"id": "wamid.HBgL..."}]}

File Size Limit

WhatsApp allows up to 16MB for video uploads. If a clip exceeds this:

ffmpeg -i clip_N_final.mp4 -fs 15M -c:v libx264 -crf 30 -preset fast -c:a aac -movflags +faststart -y clip_N_wa.mp4

24-Hour Messaging Window

WhatsApp requires the recipient to have messaged you within the last 24 hours (for non-template messages). If you get a "template required" error, either:

Ask the recipient to send any message to the business number first
Use a pre-approved message template instead of a free-form video message

Common Errors

Error Code	Description	Fix
100	Invalid parameter	Check phone_number_id and recipient format (no + prefix, no spaces)
190	Invalid/expired access token	Regenerate token in Meta Business Settings; temporary tokens expire in 24h
131030	Recipient not in allowed list	In test mode, add recipient to allowed numbers in Meta Developer Portal
131047	Re-engagement message / template required	Recipient hasn't messaged within 24h — use a template or ask them to message first
131053	Media upload failed	File too large or unsupported format — re-encode as MP4 under 16MB

clip-hand-skill

Video Clipping Expert Knowledge

Cross-Platform Notes

yt-dlp Reference

Download with Format Selection

Metadata Inspection

YouTube Auto-Subtitles

Useful Flags

Whisper Transcription Reference

Audio Extraction for Whisper

Basic Transcription

Model Sizes

JSON Output Structure

YouTube json3 Subtitle Parsing

Format Structure

Extracting Word Timing

SRT Generation from Transcript

SRT Format

Rules for Building Good SRT

Styled Captions with ASS Format

FFmpeg Video Processing

Scene Detection

Silence Detection

Clip Extraction

Vertical Video (9:16 for Shorts/Reels/TikTok)

Caption Burn-in

Thumbnail Generation

Video Analysis

API-Based STT Reference

Groq Whisper API

OpenAI Whisper API

Deepgram Nova-2

TTS Reference

Edge TTS (free, no API key needed)

OpenAI TTS

ElevenLabs

Audio Merging (TTS + Original)

Quality & Performance Tips

Telegram Bot API Reference

sendVideo — Upload and send a video to a chat/channel

Parameters

Success Response

Error Response

Common Errors

File Size Limit

WhatsApp Business Cloud API Reference

Two-Step Flow: Upload Media → Send Message

Step 1 — Upload Media

Step 2 — Send Video Message

File Size Limit

24-Hour Messaging Window

Common Errors