Prompt Caching for Agentic AI Applications

Prompt caching is the single most impactful optimization for agentic AI systems. In Claude Code, prompt caching reduced costs by ~90% by ensuring the vast majority of tokens are cache reads rather than cache writes or uncached input.

Core Principle: Prefix Matching

Prompt caching works on exact prefix matching. The API caches the longest prefix of your request that matches a previous request. Any change at position N invalidates the cache for everything after position N.

Request layout (order matters):

[System Prompt] → [Tool Definitions] → [System Reminders] → [Conversation Messages]

The system prompt and tool definitions form the stable prefix. Conversation messages are the dynamic suffix that grows with each turn. This layout maximizes cache hits because the stable prefix is identical across every request in a session.

Key insight: You are NOT paying for the full context on every turn. With caching, you pay full price once (cache write), then ~10% for every subsequent read. The longer your stable prefix, the more you save.

1. System Prompt Ordering: Static First, Dynamic Last

Structure your system prompt with the most stable content at the top:

1. Core identity and capabilities (never changes)
2. Tool usage instructions (rarely changes)
3. Code style guidelines (rarely changes)
4. Project-specific context from CLAUDE.md (changes per project)
5. Environment info, git status (changes per session)
6. Memory files, auto-loaded skills (changes per session)

Anti-pattern: Putting timestamps, random session IDs, or frequently changing data at the top of the system prompt. This invalidates the entire cache on every request.

Pattern: If you must include dynamic data (current date, git status), append it at the very end of the system prompt or use system reminder messages between conversation turns.

2. System Messages for Mid-Conversation Updates

When you need to inject dynamic context mid-conversation (new file contents, updated state, skill content), use system-type messages inserted between conversation turns rather than modifying the system prompt.

System Prompt (cached) → Tools (cached) → [turn 1] → <system-reminder> → [turn 2] → ...

This preserves the stable prefix cache while still delivering fresh context. The system reminder only adds new tokens at the end of the message sequence.

Use cases:

Injecting skill content when a skill is triggered
Updating git status or environment state
Adding hook feedback or validation results
Loading file contents referenced by the user

3. Model Switching: The Subagent Pattern

Switching models mid-session (e.g., from Opus to Haiku for a simple task) destroys your cache because different models have separate cache pools.

Anti-pattern: Switching the primary model mid-conversation.

Pattern: Use subagents — spawn a separate, short-lived agent on the cheaper model. The subagent has its own conversation context (and its own cache), while the parent agent's cache remains intact.

Main agent (Opus, cached context preserved)
  └── Subagent (Haiku, fresh context, cheap)
        └── Returns result to main agent

Benefits:

Main agent cache stays warm
Subagent context is minimal (only the specific task)
Subagent results are summarized back, keeping main context lean

4. Tool Stability: Never Mutate the Tool Set

Tool definitions are part of the cached prefix. Adding, removing, or reordering tools mid-session invalidates the cache for the entire tool block and everything after it.

Anti-pattern: Dynamically adding/removing tools based on conversation state.

Pattern: Define ALL tools upfront. Use state transition tools to control which actions are valid:

# Instead of removing "execute" tool when not in execute mode:
# Define a "request_execution_permission" tool that transitions state

Tools (always present):
  - read_file
  - write_file
  - request_execution_permission  ← gates access
  - execute_command               ← always defined, validated at runtime

Runtime validation happens in your application logic, not in the tool definitions sent to the API.

5. Plan Mode Pattern: Tools as State Machines

The EnterPlanMode / ExitPlanMode pattern demonstrates cache-safe state transitions:

Tools defined (constant):
  - EnterPlanMode    (available when: not in plan mode)
  - ExitPlanMode     (available when: in plan mode)
  - Read, Glob, Grep (available in both modes)
  - Edit, Write      (available when: not in plan mode)

All tools are always present in the API request. The application layer enforces which tools are valid based on current state. When the model calls a tool that's not valid in the current state, the application returns an error message — it does NOT remove the tool from the next request.

6. Tool Search / Deferred Loading

For large tool sets (e.g., many MCP servers), loading all tools upfront can bloat the prefix. Use a lightweight stub + deferred loading pattern:

Initial tool set:
  - core_tools (read, write, search, etc.)
  - tool_search(query: string)  ← meta-tool

When model calls tool_search("database"):
  → System finds matching MCP tools
  → Returns tool descriptions as TEXT in the tool result
  → Model uses the discovered tool on the next turn

The key insight: tool descriptions returned as text in a tool result don't affect the cached prefix. Only the formal tool definitions in the API request affect caching.

For tools that support defer_loading: true, the tool is not included in the initial request but can be loaded on demand without invalidating the prefix of other tools — because it's appended at the end of the tool list.

7. Cache-Safe Compaction

When conversation context approaches the context window limit, you must compact (summarize) older messages. Naive compaction destroys the cache. Cache-safe compaction preserves it.

Cache-safe compaction flow:

Before compaction:
  [System Prompt] [Tools] [msg1] [msg2] ... [msg50]
                   ↑ cached prefix

After compaction (fork the conversation):
  [System Prompt] [Tools] [summary of msg1..msg45] [msg46] ... [msg50]
                   ↑ same cached prefix preserved!

Critical rules:

Never modify the system prompt or tools during compaction — this preserves the prefix cache
Keep recent messages verbatim — the model needs exact recent context for coherent continuation
Summarize older messages — replace early messages with a concise summary
Use a compaction buffer — trigger compaction before hitting the limit, not at the limit. Leave room for the summary + a few more turns

Compaction buffer sizing:

context_window = 200k tokens
compaction_trigger = 160k tokens (80%)
compaction_target = 100k tokens (50%)
preserved_recent = last 20-30 messages

8. Cache-Safe Forking

When you need to explore multiple approaches (e.g., trying different fixes), fork the conversation:

Main context: [System] [Tools] [msg1..msg20]

Fork A: [System] [Tools] [msg1..msg20] [try approach A]
Fork B: [System] [Tools] [msg1..msg20] [try approach B]

Both forks share the same prefix cache. This is dramatically cheaper than starting fresh conversations for each approach.

9. Monitoring and Debugging Cache Performance

Track these metrics for every API request:

Metric	Target	Alert Threshold
Cache read rate	>90% of input tokens	<80%
Cache write rate	<10% of input tokens	>20%
Uncached tokens	<5% of input tokens	>10%

Common cache miss causes:

Dynamic content at the start of system prompt (timestamps, random IDs)
Tool definitions changed between requests
Model switched mid-session
System prompt modified mid-session
Tool order shuffled between requests

Debugging steps:

Compare the system prompt between two consecutive requests — any diff?
Compare tool definitions — any added/removed/reordered?
Check if model changed between requests
Look for any prefix modification

10. Auto-Caching vs Manual Cache Control

The Anthropic API supports automatic caching for prompts ≥1024 tokens (2048 for Claude 3.5 Haiku). No explicit cache breakpoints needed — the system automatically caches the longest matching prefix.

Manual cache control uses the cache_control parameter for explicit breakpoints:

{
  "system": [
    {
      "type": "text",
      "text": "Your system prompt...",
      "cache_control": {"type": "ephemeral"}
    }
  ]
}

When to use manual vs auto:

Scenario	Recommendation
Agentic loops (Claude Code, Manus)	Auto-caching sufficient — prefix grows naturally
Short prompts (<1024 tokens)	Manual `cache_control` to force caching
Multi-turn with stable tools	Auto-caching handles this well
Critical breakpoints you must guarantee	Manual `cache_control` for precision

Cost impact: Cached token reads cost ~10% of base input price. In agentic loops where the same prefix is sent dozens of times, this reduces costs by up to 90%. Cache hit rate is the #1 cost metric to track.

For detailed auto-caching implementation, pricing breakdown, and agentic loop optimization, see references/auto-caching-api.md.

Quick Reference: Is It Cache-Safe?

Action	Cache-Safe?	Alternative
Add dynamic data to system prompt top	No	Append at end or use system reminders
Remove a tool mid-session	No	Keep tool, validate at runtime
Add a new tool mid-session	Partial	Append at end of tool list
Switch model mid-conversation	No	Use subagent on different model
Reorder tools between requests	No	Keep consistent tool ordering
Modify system prompt	No	Use system reminder messages
Summarize old messages	Yes	Fork with same prefix
Add new user/assistant turns	Yes	Normal conversation flow
Inject system reminder between turns	Yes	Appended after cached prefix

For detailed implementation patterns, pseudocode examples, and anti-pattern catalog, see references/caching-patterns.md.

prompt-caching