media-understand

SKILL.md

Media Understanding

使用 Gemini 2.5 Flash 分析和理解多媒体内容。

Supported Formats

Type Formats Max Size
Image jpg, jpeg, png, gif, webp 20MB
Video mp4, mpeg, mov, webm, YouTube URL 100MB
Audio wav, mp3, aiff, aac, ogg, flac, m4a 100MB

Prerequisites

  1. MAX_API_KEY 环境变量(Max 自动注入)
  2. Bun 1.0+(Max v0.0.27+ 内置,无需额外安装)

Usage

bun skills/media-understand/media-understand.js <media_path_or_url> [prompt] [language]

Arguments:

  • media_path_or_url: File path or YouTube URL
  • prompt: Question or analysis request (default: "Please describe this content")
  • language: Output language - chinese or english (default: chinese)

Examples

Image Analysis

# Describe image
bun skills/media-understand/media-understand.js ./photo.jpg "请描述这张图片" chinese

# OCR - Extract text
bun skills/media-understand/media-understand.js ./screenshot.png "识别图片中的所有文字" chinese

# Answer question about image
bun skills/media-understand/media-understand.js ./chart.png "这个图表显示了什么趋势?" chinese

Video Analysis

# YouTube video summary
bun skills/media-understand/media-understand.js "https://youtube.com/watch?v=xxx" "总结这个视频的主要内容" chinese

# Local video analysis
bun skills/media-understand/media-understand.js ./video.mp4 "视频中发生了什么?" chinese

# Timestamp-based question
bun skills/media-understand/media-understand.js "https://youtu.be/xxx" "视频 2:30 处讲了什么?" chinese

Audio Analysis

# Transcribe audio
bun skills/media-understand/media-understand.js ./recording.mp3 "请转录这段音频" chinese

# Summarize podcast
bun skills/media-understand/media-understand.js ./podcast.m4a "总结这段播客的要点" chinese

# Detect speakers
bun skills/media-understand/media-understand.js ./meeting.wav "识别不同的说话人并整理他们说的内容" chinese

Common Prompts

Image:

  • 描述图片: "请详细描述这张图片的内容"
  • OCR: "识别并提取图片中的所有文字"
  • 物体识别: "图片中有哪些物体?"

Video:

  • 总结: "总结这个视频的主要内容"
  • 时间戳: "视频 X:XX 处发生了什么?"
  • 提取信息: "视频中提到了哪些关键信息?"

Audio:

  • 转录: "请转录这段音频的完整内容"
  • 总结: "总结这段音频的要点"
  • 说话人识别: "识别不同的说话人"

Notes

  • Video via Gemini: Best results with YouTube URLs. Local video files may have limited support.
  • Audio tokens: ~32 tokens/second
  • Video tokens: ~300 tokens/second at default resolution
  • Long media files will consume more tokens

Error Handling

File not found: Check the file path is correct

Unsupported format: Use supported formats listed above

File too large: Compress or trim the media file

API error: 请在 Max 设置中检查 Max API Key 是否正确配置

Weekly Installs
31
First Seen
Jan 22, 2026
Installed on
opencode28
gemini-cli27
codex27
cursor26
cline24
github-copilot24