multimodal-models
SKILL.md
Multimodal Models
Pre-trained models for vision, audio, and cross-modal tasks.
Model Overview
| Model | Modality | Task |
|---|---|---|
| CLIP | Image + Text | Zero-shot classification, similarity |
| Whisper | Audio → Text | Transcription, translation |
| Stable Diffusion | Text → Image | Image generation, editing |
CLIP (Vision-Language)
Zero-shot image classification without training on specific labels.
CLIP Use Cases
| Task | How |
|---|---|
| Zero-shot classification | Compare image to text label embeddings |
| Image search | Find images matching text query |
| Content moderation | Classify against safety categories |
| Image similarity | Compare image embeddings |
CLIP Models
| Model | Parameters | Trade-off |
|---|---|---|
| ViT-B/32 | 151M | Recommended balance |
| ViT-L/14 | 428M | Best quality, slower |
| RN50 | 102M | Fastest, lower quality |
CLIP Concepts
| Concept | Description |
|---|---|
| Dual encoder | Separate encoders for image and text |
| Contrastive learning | Trained to match image-text pairs |
| Normalization | Always normalize embeddings before similarity |
| Descriptive labels | Better labels = better zero-shot accuracy |
Key concept: CLIP embeds images and text in same space. Classification = find nearest text embedding.
CLIP Limitations
- Not for fine-grained classification
- No spatial understanding (whole image only)
- May reflect training data biases
Whisper (Speech Recognition)
Robust multilingual transcription supporting 99 languages.
Whisper Use Cases
| Task | Configuration |
|---|---|
| Transcription | Default transcribe task |
| Translation to English | task="translate" |
| Subtitles | Output format SRT/VTT |
| Word timestamps | word_timestamps=True |
Whisper Models
| Model | Size | Speed | Recommendation |
|---|---|---|---|
| turbo | 809M | Fast | Recommended |
| large | 1550M | Slow | Maximum quality |
| small | 244M | Medium | Good balance |
| base | 74M | Fast | Quick tests |
| tiny | 39M | Fastest | Prototyping only |
Whisper Concepts
| Concept | Description |
|---|---|
| Language detection | Auto-detects, or specify for speed |
| Initial prompt | Improves technical terms accuracy |
| Timestamps | Segment-level or word-level |
| faster-whisper | 4× faster alternative implementation |
Key concept: Specify language when known—auto-detection adds latency.
Whisper Limitations
- May hallucinate on silence/noise
- No speaker diarization (who said what)
- Accuracy degrades on >30 min audio
- Not suitable for real-time captioning
Stable Diffusion (Image Generation)
Text-to-image generation with various control methods.
SD Use Cases
| Task | Pipeline |
|---|---|
| Text-to-image | DiffusionPipeline |
| Style transfer | Image2Image |
| Fill regions | Inpainting |
| Guided generation | ControlNet |
| Custom styles | LoRA adapters |
SD Models
| Model | Resolution | Quality |
|---|---|---|
| SDXL | 1024×1024 | Best |
| SD 1.5 | 512×512 | Good, faster |
| SD 2.1 | 768×768 | Middle ground |
Key Parameters
| Parameter | Effect | Typical Value |
|---|---|---|
| num_inference_steps | Quality vs speed | 20-50 |
| guidance_scale | Prompt adherence | 7-12 |
| negative_prompt | Avoid artifacts | "blurry, low quality" |
| strength (img2img) | How much to change | 0.5-0.8 |
| seed | Reproducibility | Fixed number |
Control Methods
| Method | Input | Use Case |
|---|---|---|
| ControlNet | Edge/depth/pose | Structural guidance |
| LoRA | Trained weights | Custom styles |
| Img2Img | Source image | Style transfer |
| Inpainting | Image + mask | Fill regions |
Memory Optimization
| Technique | Effect |
|---|---|
| CPU offload | Reduces VRAM usage |
| Attention slicing | Trades speed for memory |
| VAE tiling | Large image support |
| xFormers | Faster attention |
| DPM scheduler | Fewer steps needed |
Key concept: Use SDXL for quality, SD 1.5 for speed. Always use negative prompts.
SD Limitations
- GPU strongly recommended (CPU very slow)
- Large VRAM requirements for SDXL
- May generate anatomical errors
- Prompt engineering matters
Common Patterns
Embedding and Similarity
All three models use embeddings:
- CLIP: Image/text embeddings for similarity
- Whisper: Audio embeddings for transcription
- SD: Text embeddings for image conditioning
GPU Acceleration
| Model | VRAM Needed |
|---|---|
| CLIP ViT-B/32 | ~2 GB |
| Whisper turbo | ~6 GB |
| SD 1.5 | ~6 GB |
| SDXL | ~10 GB |
Best Practices
| Practice | Why |
|---|---|
| Use recommended model sizes | Best quality/speed balance |
| Cache embeddings (CLIP) | Expensive to recompute |
| Specify language (Whisper) | Faster than auto-detect |
| Use negative prompts (SD) | Avoid common artifacts |
| Set seeds for reproducibility | Consistent results |
Resources
- CLIP: https://github.com/openai/CLIP
- Whisper: https://github.com/openai/whisper
- Diffusers: https://huggingface.co/docs/diffusers
Weekly Installs
31
Repository
eyadsibai/ltkFirst Seen
Jan 28, 2026
Security Audits
Installed on
gemini-cli26
opencode24
github-copilot23
codex23
claude-code21
antigravity20