Gemini Multimodal Media (Image/Video/Speech) Skill

1. Goals and scope

This Skill consolidates six Gemini API capabilities into reusable workflows and implementation templates:

Image generation (Nano Banana: text-to-image, image editing, multi-turn iteration)
Image understanding (caption/VQA/classification/comparison, multi-image prompts; supports inline and Files API)
Video generation (Veo 3.1: text-to-video, aspect ratio/resolution control, reference-image guidance, first/last frames, video extension, native audio)
Video understanding (upload/inline/YouTube URL; summaries, Q&A, timestamped evidence)
Speech generation (Gemini native TTS: single-speaker and multi-speaker; controllable style/accent/pace/tone)
Audio understanding (upload/inline; description, transcription, time-range transcription, token counting)

Convention: This Skill follows the official Google Gen AI SDK (Node.js/REST) as the main line; currently only Node.js/REST examples are provided. If your project already wraps other languages or frameworks, map this Skill's request structure, model selection, and I/O spec to your wrapper layer.

google-gemini-media

Gemini Multimodal Media (Image/Video/Speech) Skill

1. Goals and scope

2. Quick routing (decide which capability to use)