prompt-caching
Prompt Caching Skill
Leverage Anthropic's prompt caching to dramatically reduce latency and costs for repeated prompts.
When to Use This Skill
- RAG systems with large static documents
- Multi-turn conversations with long instructions
- Code analysis with large codebase context
- Batch processing with shared prefixes
- Document analysis and summarization
Core Concepts
Cache Control Placement
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a helpful assistant with access to a large knowledge base...",
"cache_control": {"type": "ephemeral"} # Cache this content
}
],
messages=[{"role": "user", "content": "What is...?"}]
)
Cache Hierarchy
Cache breakpoints are checked in this order:
- Tools - Tool definitions cached first
- System - System prompts cached second
- Messages - Conversation history cached last
TTL Options
| TTL | Write Cost | Read Cost | Use Case |
|---|---|---|---|
| 5 minutes (default) | 1.25x base | 0.1x base | Interactive sessions |
| 1 hour | 2.0x base | 0.1x base | Batch processing, stable docs |
Cache Requirements
- Minimum tokens: 1024-4096 (varies by model)
- Maximum breakpoints: 4 per request
- Supported models: Claude Opus 4.5, Sonnet 4.5, Haiku 4.5
Implementation Patterns
Pattern 1: Single Breakpoint (Recommended)
# Best for: Document analysis, Q&A with static context
system = [
{
"type": "text",
"text": large_document_content,
"cache_control": {"type": "ephemeral"} # Single breakpoint at end
}
]
Pattern 2: Multi-Turn Conversation
# Cache grows with conversation
messages = [
{"role": "user", "content": "First question"},
{"role": "assistant", "content": "First answer"},
{
"role": "user",
"content": "Follow-up question",
"cache_control": {"type": "ephemeral"} # Cache entire conversation
}
]
Pattern 3: RAG with Multiple Breakpoints
system = [
{
"type": "text",
"text": "Tool definitions and instructions",
"cache_control": {"type": "ephemeral"} # Breakpoint 1: Tools
},
{
"type": "text",
"text": retrieved_documents,
"cache_control": {"type": "ephemeral"} # Breakpoint 2: Documents
}
]
Pattern 4: Batch Processing with 1-Hour TTL
# Warm the cache before batch
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=100,
system=[{
"type": "text",
"text": shared_context,
"cache_control": {"type": "ephemeral", "ttl": "1h"}
}],
messages=[{"role": "user", "content": "Initialize cache"}]
)
# Now run batch - all requests hit the cache
for item in batch_items:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[{
"type": "text",
"text": shared_context,
"cache_control": {"type": "ephemeral", "ttl": "1h"}
}],
messages=[{"role": "user", "content": item}]
)
Performance Monitoring
Check Cache Usage
response = client.messages.create(...)
# Monitor these fields
cache_write = response.usage.cache_creation_input_tokens # New cache written
cache_read = response.usage.cache_read_input_tokens # Cache hit!
uncached = response.usage.input_tokens # After breakpoint
print(f"Cache hit rate: {cache_read / (cache_read + cache_write + uncached) * 100:.1f}%")
Cost Calculation
def calculate_cost(usage, model="claude-sonnet-4-20250514"):
# Example rates (check current pricing)
base_input_rate = 0.003 # per 1K tokens
write_cost = (usage.cache_creation_input_tokens / 1000) * base_input_rate * 1.25
read_cost = (usage.cache_read_input_tokens / 1000) * base_input_rate * 0.1
uncached_cost = (usage.input_tokens / 1000) * base_input_rate
return write_cost + read_cost + uncached_cost
Cache Invalidation
Changes that invalidate cache:
| Change | Impact |
|---|---|
| Tool definitions | Entire cache invalidated |
| System prompt | System + messages invalidated |
| Any content before breakpoint | That breakpoint + later invalidated |
Best Practices
DO:
- Place breakpoint at END of static content
- Keep tools/instructions stable across requests
- Use 1-hour TTL for batch processing
- Monitor cache_read_input_tokens for savings
DON'T:
- Place breakpoint in middle of dynamic content
- Change tool definitions frequently
- Expect cache to work with <1024 tokens
- Ignore the 20-block lookback limit
Integration with Extended Thinking
# Cache + Extended Thinking
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 10000},
system=[{
"type": "text",
"text": large_context,
"cache_control": {"type": "ephemeral"}
}],
messages=[{"role": "user", "content": "Analyze this..."}]
)
See Also
- [[llm-integration]] - Claude API basics
- [[extended-thinking]] - Deep reasoning
- [[batch-processing]] - Bulk processing
More from lobbi-docs/claude
vision-multimodal
Vision and multimodal capabilities for Claude including image analysis, PDF processing, and document understanding. Activate for image input, base64 encoding, multiple images, and visual analysis.
242design-system
Apply and manage the AI-powered design system with 50+ curated styles
126complex-reasoning
Multi-step reasoning patterns and frameworks for systematic problem solving. Activate for Chain-of-Thought, Tree-of-Thought, hypothesis-driven debugging, and structured analytical approaches that leverage extended thinking.
105gcp
Google Cloud Platform services including GKE, Cloud Run, Cloud Storage, BigQuery, and Pub/Sub. Activate for GCP infrastructure, Google Cloud deployment, and GCP integration.
73kanban
Kanban methodology including boards, WIP limits, flow metrics, and continuous delivery. Activate for Kanban boards, workflow visualization, and lean project management.
62debugging
Debugging techniques for Python, JavaScript, and distributed systems. Activate for troubleshooting, error analysis, log investigation, and performance debugging. Includes extended thinking integration for complex debugging scenarios.
59