vision-multimodal
Vision & Multimodal Skill
Leverage Claude's vision capabilities for image analysis, document processing, and multimodal understanding.
When to Use This Skill
- Image analysis and description
- Document/PDF processing
- Screenshot analysis
- OCR-like text extraction
- Visual comparison
- Chart and diagram interpretation
Supported Formats
| Format | Status | Best For |
|---|---|---|
| JPEG | ✓ | Photos, natural scenes |
| PNG | ✓ | Screenshots, UI, text |
| GIF | ✓ | Animated (first frame) |
| WebP | ✓ | Modern, compressed |
| ✓ | Documents (via Files API) |
Image Size Guidelines
- Minimum: 200 pixels (smaller = reduced accuracy)
- Optimal: 1000x1000 pixels
- Maximum: 8000x8000 pixels
- Token cost: ~(width × height) / 1000
- Tip: Resize to 1568px max dimension for 30-50% token savings
Core Patterns
Pattern 1: Single Image Analysis
import anthropic
import base64
client = anthropic.Anthropic()
# Load and encode image
with open("image.jpg", "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": image_data
}
},
{
"type": "text",
"text": "Describe this image in detail."
}
]
}]
)
Pattern 2: Image from URL
import httpx
# Fetch and encode from URL
image_url = "https://example.com/image.jpg"
response = httpx.get(image_url)
image_data = base64.standard_b64encode(response.content).decode("utf-8")
# Then use same pattern as above
Pattern 3: Multiple Images
# Compare multiple images (up to 100 per request)
messages = [{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image1}},
{"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image2}},
{"type": "text", "text": "Compare these two images and list the differences."}
]
}]
Pattern 4: Few-Shot with Images
# Teach by example
messages = [
# Example 1
{"role": "user", "content": [
{"type": "image", "source": {...}},
{"type": "text", "text": "Classify this image."}
]},
{"role": "assistant", "content": "Category: Landscape\nElements: Mountains, lake, trees"},
# Example 2
{"role": "user", "content": [
{"type": "image", "source": {...}},
{"type": "text", "text": "Classify this image."}
]},
{"role": "assistant", "content": "Category: Portrait\nElements: Person, indoor, professional"},
# Target image
{"role": "user", "content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": target_image}},
{"type": "text", "text": "Classify this image."}
]}
]
Pattern 5: PDF Processing
# Using Files API (beta)
with open("document.pdf", "rb") as f:
pdf_data = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
messages=[{
"role": "user",
"content": [
{
"type": "document",
"source": {
"type": "base64",
"media_type": "application/pdf",
"data": pdf_data
}
},
{"type": "text", "text": "Summarize this document."}
]
}]
)
Prompt Engineering for Vision
Strategy 1: Role Assignment
prompt = """You have perfect vision and exceptional attention to detail,
making you an expert at analyzing technical diagrams.
Analyze this architecture diagram and identify:
1. All components
2. Data flow between components
3. Potential bottlenecks"""
Strategy 2: Step-by-Step Thinking
prompt = """Before answering, analyze the image systematically:
<thinking>
1. What is the overall subject?
2. What are the key elements?
3. How do elements relate to each other?
4. What details stand out?
</thinking>
Then provide your answer based on this analysis."""
Strategy 3: Structured Output
prompt = """Extract information from this receipt and return as JSON:
{
"vendor": "",
"date": "",
"items": [{"name": "", "price": 0}],
"total": 0
}"""
Image Optimization
from PIL import Image
import io
def optimize_for_claude(image_path, max_dimension=1568):
"""Resize image to reduce token usage by 30-50%"""
with Image.open(image_path) as img:
# Calculate new dimensions
ratio = min(max_dimension / img.width, max_dimension / img.height)
if ratio < 1:
new_size = (int(img.width * ratio), int(img.height * ratio))
img = img.resize(new_size, Image.LANCZOS)
# Convert to bytes
buffer = io.BytesIO()
img.save(buffer, format="JPEG", quality=85)
return base64.standard_b64encode(buffer.getvalue()).decode("utf-8")
Common Use Cases
Text Extraction (OCR-like)
prompt = """Extract all text from this image.
Preserve the original formatting and structure as much as possible.
If text is unclear, indicate with [unclear]."""
Table Extraction
prompt = """Extract the table data from this image.
Return as a markdown table with proper headers and alignment."""
Chart Analysis
prompt = """Analyze this chart:
1. What type of chart is this?
2. What are the axes/labels?
3. What are the key data points?
4. What trends or patterns are visible?"""
Best Practices
DO:
- Use high-quality images (≥1000px)
- Resize large images to save tokens
- Provide context about what to look for
- Use few-shot examples for consistent output
DON'T:
- Send images smaller than 200px
- Expect perfect OCR for handwriting
- Send very large images (>8000px)
- Ignore token costs for multiple images
Limitations
- Cannot identify specific individuals
- May struggle with very small text
- Animated GIFs: only first frame analyzed
- Some specialized symbols may be misread
See Also
- [[llm-integration]] - API basics
- [[extended-thinking]] - Complex reasoning
- [[citations-retrieval]] - Document citations
More from lobbi-docs/claude
design-system
Apply and manage the AI-powered design system with 50+ curated styles
126complex-reasoning
Multi-step reasoning patterns and frameworks for systematic problem solving. Activate for Chain-of-Thought, Tree-of-Thought, hypothesis-driven debugging, and structured analytical approaches that leverage extended thinking.
105kanban
Kanban methodology including boards, WIP limits, flow metrics, and continuous delivery. Activate for Kanban boards, workflow visualization, and lean project management.
63deep-analysis
Analytical thinking patterns for comprehensive evaluation, code audits, security analysis, and performance reviews. Provides structured templates for thorough investigation with extended thinking support.
47batch-processing
Message Batches API for Claude with 50% cost savings on bulk processing. Activate for batch jobs, JSONL processing, bulk analysis, and cost optimization.
46aws
AWS cloud services including EC2, EKS, S3, Lambda, RDS, and IAM. Activate for AWS infrastructure, cloud deployment, and Amazon Web Services integration.
38