multimodal-llm
Installation
SKILL.md
Multimodal LLM Patterns
Integrate vision, audio, and video generation capabilities from leading multimodal models. Covers image analysis, document understanding, real-time voice agents, speech-to-text, text-to-speech, and AI video generation (Kling v3, Sora 2, Veo 3.1 std/lite/fast tiers, Runway Gen-4.5 via gen4_turbo).
Canonical model IDs (pinned against
yonatan-hq/platform/apps/api/app/config.py):
Provider Model IDs Anthropic claude-opus-4-8(latest),claude-opus-4-7,claude-opus-4-6,claude-sonnet-4-6,claude-haiku-4-5-20251001OpenAI gpt-5.5(current flagship)gemini-3.1-pro-preview(flagship),gemini-3.1-flash-lite-preview(cost)Veo veo-3.1-generate-preview/veo-3.1-lite-generate-preview/veo-3.1-fast-generate-previewKling kling-v3(model_name field in Kling API)Runway gen4_turbo(product label: Gen-4.5)