video-reader
Video Reader Skill
Extract key frames to "see" videos, and extract + transcribe audio to "hear" them.
Quick Start
# Get video info (duration, resolution, codec)
ffprobe -v error -show_entries format=duration:stream=codec_name,width,height -of json "$VIDEO_PATH"
# Extract key frames (1 per second, max 12 frames)
OUTDIR=/tmp/alma-frames-$(date +%s)
mkdir -p "$OUTDIR"
DURATION=$(ffprobe -v error -show_entries format=duration -of csv=p=0 "$VIDEO_PATH" | cut -d. -f1)
FPS_RATE=$(echo "scale=2; 12 / $DURATION" | bc 2>/dev/null || echo "1")
# Cap at 1fps for short videos
if (( $(echo "$FPS_RATE > 1" | bc -l 2>/dev/null || echo 0) )); then FPS_RATE=1; fi
ffmpeg -hide_banner -loglevel error -i "$VIDEO_PATH" -vf "fps=$FPS_RATE,scale=720:-1" -frames:v 12 "$OUTDIR/frame_%02d.jpg"
ls "$OUTDIR"
How to Use
- Get video info first to know the duration
- For short videos (<15s): extract 1 frame per second
- For medium videos (15-60s): extract ~8-12 frames evenly spread
- For long videos (>60s): extract 12 frames at key intervals
- Look at the extracted frames (they're image files) to describe the video content
Frame Extraction Patterns
# Even spread: N frames across entire video
ffmpeg -hide_banner -loglevel error -i "$VIDEO_PATH" \
-vf "select='not(mod(n\,$(ffprobe -v error -count_frames -select_streams v:0 -show_entries stream=nb_read_frames -of csv=p=0 "$VIDEO_PATH" | awk -v n=8 '{printf "%d", $1/n}')))'" \
-vsync vfr -frames:v 8 "$OUTDIR/frame_%02d.jpg"
# Specific timestamp
ffmpeg -hide_banner -loglevel error -ss 5.0 -i "$VIDEO_PATH" -frames:v 1 "$OUTDIR/at_5s.jpg"
# Thumbnail grid (single image overview)
ffmpeg -hide_banner -loglevel error -i "$VIDEO_PATH" \
-vf "fps=1/$(ffprobe -v error -show_entries format=duration -of csv=p=0 "$VIDEO_PATH" | awk '{printf "%.1f", $1/9}'),scale=320:-1,tile=3x3" \
-frames:v 1 "$OUTDIR/grid.jpg"
Tips
- Always use
-hide_banner -loglevel errorto keep output clean - Scale down to 720px width (
scale=720:-1) to save tokens when sending to AI - Clean up frames after analysis:
rm -rf "$OUTDIR" - The extracted frames are regular image files — include their paths in your reply and they'll be auto-sent to Telegram
Audio: "Hearing" Videos
# Extract audio from video
ffmpeg -hide_banner -loglevel error -i "$VIDEO_PATH" -vn -acodec pcm_s16le -ar 16000 -ac 1 "/tmp/alma-audio-$(date +%s).wav"
# Transcribe with Whisper (auto-detect language)
whisper "/tmp/alma-audio.wav" --model turbo --output_format txt --output_dir /tmp/alma-whisper
# Transcribe with specific language
whisper "/tmp/alma-audio.wav" --model turbo --language zh --output_format txt --output_dir /tmp/alma-whisper
# Read transcription
cat /tmp/alma-whisper/*.txt
When to See vs Hear
- "这个视频里有啥" → Extract frames (see) + transcribe audio (hear) for full picture
- "他说了什么" → Transcribe audio only
- "这个视频好看吗" → Extract frames to see the visuals
- "好听" → The user is commenting on audio content, transcribe to understand
- Music/street performance → Mention what you see in frames + note the audio content
- When in doubt, do BOTH — extract a few frames AND transcribe the audio
Cleanup
Always clean up temp files after analysis:
rm -rf "$OUTDIR" /tmp/alma-whisper /tmp/alma-audio*.wav
More from naohainezha/skill
reactions
React to the user's Telegram message with an emoji. Use when the message evokes a genuine emotional response.
28self-reflection
Daily self-reflection and personal growth. Triggered by heartbeat at end of day. Review the day's experiences, extract lessons, update personality, and write a diary entry.
6voice
Send voice messages (TTS) to the user via Telegram. Use when replying to voice messages or when a voice reply feels natural.
3scheduler
Create, manage, and delete scheduled tasks (cron jobs) and configure heartbeat. Use when users ask for reminders, recurring tasks, daily summaries, periodic checks, or anything time-based. Also manages HEARTBEAT.md for periodic awareness checks.
3thread-management
Manage chat threads — create, list, switch, delete, and search conversations. Use when users want to organize their chats.
3memory-management
Search and manage Alma's memory and conversation history. Use when the user asks about past conversations, personal facts, preferences, or anything that requires recalling information ("你知道我...吗", "我们之前聊过...", "你还记得...", "帮我找之前说的..."). Also used to store new memories and search through archived chat threads.
3