ai-multimodal
Originally fromsamhvw8/dot-claude
Installation
SKILL.md
AI Multimodal Processing Skill
Process audio, images, videos, documents, and generate images using Google Gemini's multimodal API. Unified interface for all multimedia content understanding and generation.
Core Capabilities
Audio Processing
- Transcription with timestamps (up to 9.5 hours)
- Audio summarization and analysis
- Speech understanding and speaker identification
- Music and environmental sound analysis
- Text-to-speech generation with controllable voice
Image Understanding
- Image captioning and description
- Object detection with bounding boxes (2.0+)
- Pixel-level segmentation (2.5+)
- Visual question answering
- Multi-image comparison (up to 3,600 images)
Related skills
More from jackspace/claudeskillz
base-ui-react
|
202rapid-prototyper
Creates minimal working prototypes for quick idea validation. Single-file when possible, includes test data, ready to demo immediately. Use when user says "prototype", "MVP", "proof of concept", "quick demo".
180repository-analyzer
Analyzes codebases to generate comprehensive documentation including structure, languages, frameworks, dependencies, design patterns, and technical debt. Use when user says "analyze repository", "understand codebase", "document project", or when exploring unfamiliar code.
177windows-expert
Expert guidance for Windows, PowerShell, WSL interop, and cross-platform development
120hugo
|
119firecrawl-scraper
|
100