Multimodal Models
Pre-trained models for vision, audio, and cross-modal tasks.
Model Overview
| Model |
Modality |
Task |
| CLIP |
Image + Text |
Zero-shot classification, similarity |
| Whisper |
Audio → Text |
Transcription, translation |
| Stable Diffusion |
Text → Image |
Image generation, editing |
CLIP (Vision-Language)
Zero-shot image classification without training on specific labels.
CLIP Use Cases
| Task |
How |
| Zero-shot classification |
Compare image to text label embeddings |
| Image search |
Find images matching text query |
| Content moderation |
Classify against safety categories |
| Image similarity |
Compare image embeddings |
CLIP Models
| Model |
Parameters |
Trade-off |
| ViT-B/32 |
151M |
Recommended balance |
| ViT-L/14 |
428M |
Best quality, slower |
| RN50 |
102M |
Fastest, lower quality |
CLIP Concepts
| Concept |
Description |
| Dual encoder |
Separate encoders for image and text |
| Contrastive learning |
Trained to match image-text pairs |
| Normalization |
Always normalize embeddings before similarity |
| Descriptive labels |
Better labels = better zero-shot accuracy |
Key concept: CLIP embeds images and text in same space. Classification = find nearest text embedding.
CLIP Limitations
- Not for fine-grained classification
- No spatial understanding (whole image only)
- May reflect training data biases
Whisper (Speech Recognition)
Robust multilingual transcription supporting 99 languages.
Whisper Use Cases
| Task |
Configuration |
| Transcription |
Default transcribe task |
| Translation to English |
task="translate" |
| Subtitles |
Output format SRT/VTT |
| Word timestamps |
word_timestamps=True |
Whisper Models
| Model |
Size |
Speed |
Recommendation |
| turbo |
809M |
Fast |
Recommended |
| large |
1550M |
Slow |
Maximum quality |
| small |
244M |
Medium |
Good balance |
| base |
74M |
Fast |
Quick tests |
| tiny |
39M |
Fastest |
Prototyping only |
Whisper Concepts
| Concept |
Description |
| Language detection |
Auto-detects, or specify for speed |
| Initial prompt |
Improves technical terms accuracy |
| Timestamps |
Segment-level or word-level |
| faster-whisper |
4× faster alternative implementation |
Key concept: Specify language when known—auto-detection adds latency.
Whisper Limitations
- May hallucinate on silence/noise
- No speaker diarization (who said what)
- Accuracy degrades on >30 min audio
- Not suitable for real-time captioning
Stable Diffusion (Image Generation)
Text-to-image generation with various control methods.
SD Use Cases
| Task |
Pipeline |
| Text-to-image |
DiffusionPipeline |
| Style transfer |
Image2Image |
| Fill regions |
Inpainting |
| Guided generation |
ControlNet |
| Custom styles |
LoRA adapters |
SD Models
| Model |
Resolution |
Quality |
| SDXL |
1024×1024 |
Best |
| SD 1.5 |
512×512 |
Good, faster |
| SD 2.1 |
768×768 |
Middle ground |
Key Parameters
| Parameter |
Effect |
Typical Value |
| num_inference_steps |
Quality vs speed |
20-50 |
| guidance_scale |
Prompt adherence |
7-12 |
| negative_prompt |
Avoid artifacts |
"blurry, low quality" |
| strength (img2img) |
How much to change |
0.5-0.8 |
| seed |
Reproducibility |
Fixed number |
Control Methods
| Method |
Input |
Use Case |
| ControlNet |
Edge/depth/pose |
Structural guidance |
| LoRA |
Trained weights |
Custom styles |
| Img2Img |
Source image |
Style transfer |
| Inpainting |
Image + mask |
Fill regions |
Memory Optimization
| Technique |
Effect |
| CPU offload |
Reduces VRAM usage |
| Attention slicing |
Trades speed for memory |
| VAE tiling |
Large image support |
| xFormers |
Faster attention |
| DPM scheduler |
Fewer steps needed |
Key concept: Use SDXL for quality, SD 1.5 for speed. Always use negative prompts.
SD Limitations
- GPU strongly recommended (CPU very slow)
- Large VRAM requirements for SDXL
- May generate anatomical errors
- Prompt engineering matters
Common Patterns
Embedding and Similarity
All three models use embeddings:
- CLIP: Image/text embeddings for similarity
- Whisper: Audio embeddings for transcription
- SD: Text embeddings for image conditioning
GPU Acceleration
| Model |
VRAM Needed |
| CLIP ViT-B/32 |
~2 GB |
| Whisper turbo |
~6 GB |
| SD 1.5 |
~6 GB |
| SDXL |
~10 GB |
Best Practices
| Practice |
Why |
| Use recommended model sizes |
Best quality/speed balance |
| Cache embeddings (CLIP) |
Expensive to recompute |
| Specify language (Whisper) |
Faster than auto-detect |
| Use negative prompts (SD) |
Avoid common artifacts |
| Set seeds for reproducibility |
Consistent results |
Resources