skills/skills.volces.com/google-gemini-media

google-gemini-media

SKILL.md

Gemini Multimodal Media (Image/Video/Speech) Skill

1. Goals and scope

This Skill consolidates six Gemini API capabilities into reusable workflows and implementation templates:

  • Image generation (Nano Banana: text-to-image, image editing, multi-turn iteration)
  • Image understanding (caption/VQA/classification/comparison, multi-image prompts; supports inline and Files API)
  • Video generation (Veo 3.1: text-to-video, aspect ratio/resolution control, reference-image guidance, first/last frames, video extension, native audio)
  • Video understanding (upload/inline/YouTube URL; summaries, Q&A, timestamped evidence)
  • Speech generation (Gemini native TTS: single-speaker and multi-speaker; controllable style/accent/pace/tone)
  • Audio understanding (upload/inline; description, transcription, time-range transcription, token counting)

Convention: This Skill follows the official Google Gen AI SDK (Node.js/REST) as the main line; currently only Node.js/REST examples are provided. If your project already wraps other languages or frameworks, map this Skill's request structure, model selection, and I/O spec to your wrapper layer.


2. Quick routing (decide which capability to use)

Installs
6
First Seen
Mar 29, 2026