Context Engineering

The discipline of curating the smallest high-signal token set that achieves desired outcomes.

Overview

Context engineering goes beyond prompt engineering. While prompts focus on what you ask, context engineering focuses on everything the model sees—system instructions, tool definitions, documents, message history, and tool outputs.

Key Insight: Context windows are constrained not by raw token capacity but by attention mechanics. As context grows, models experience degradation.

When to Use

Designing agent system prompts
Optimizing RAG retrieval pipelines
Managing long-running conversations
Building multi-agent architectures
Reducing token costs while maintaining quality

The "Lost in the Middle" Phenomenon

Models pay unequal attention across the context window:

Attention
Strength   ████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░████████
           ↑                                                      ↑
        START              MIDDLE (weakest attention)           END

Practical Implications:

Position	Attention	Best For
START	High	System identity, critical instructions, constraints
MIDDLE	Low	Background context, optional details
END	High	Current task, recent messages, immediate query

The Five Context Layers

1. System Prompts (Identity Layer)

Establishes agent identity at the right "altitude":

TOO HIGH (vague):        "You are a helpful assistant"
TOO LOW (brittle):       "Always respond with exactly 3 bullet points..."
OPTIMAL (principled):    "You are a senior engineer who values clarity,
                          tests assumptions, and explains trade-offs"

Best Practices:

Define role and expertise level
State core principles (not rigid rules)
Include what NOT to do (boundaries)
Position at START of context

2. Tool Definitions (Capability Layer)

Tools steer behavior through descriptions:

# ❌ BAD: Ambiguous - when would you use this?
@tool
def search(query: str) -> str:
    """Search for information."""
    pass

# ✅ GOOD: Clear trigger conditions
@tool
def search_documentation(query: str) -> str:
    """
    Search internal documentation for technical answers.

    USE WHEN:
    - User asks about internal APIs or services
    - Question requires company-specific knowledge
    - Public information is insufficient

    DO NOT USE WHEN:
    - Question is general programming knowledge
    - User explicitly wants external sources
    """
    pass

Rule: If a human cannot definitively say which tool to use, an agent cannot either.

3. Retrieved Documents (Knowledge Layer)

Just-in-time loading beats pre-loading:

# ❌ BAD: Pre-load everything
context = load_all_documentation()  # 50k tokens!

# ✅ GOOD: Progressive disclosure
def build_context(query: str) -> str:
    # Stage 1: Lightweight retrieval (500 tokens)
    summaries = search_summaries(query, top_k=5)

    # Stage 2: Selective deep loading (only if needed)
    if needs_detail(summaries):
        full_docs = load_full_documents(summaries[:2])
        return summaries + full_docs

    return summaries

4. Message History (Memory Layer)

Treat as scratchpad, not permanent storage:

# Implement sliding window with compression
MAX_MESSAGES = 20
COMPRESSION_TRIGGER = 0.7  # 70% of context budget

def manage_history(messages: list, budget: int) -> list:
    current_tokens = count_tokens(messages)

    if current_tokens > budget * COMPRESSION_TRIGGER:
        # Compress older messages, keep recent
        old = messages[:-5]
        recent = messages[-5:]

        summary = summarize(old)  # Anchored compression
        return [summary] + recent

    return messages

5. Tool Outputs (Observation Layer)

Critical Finding: Tool outputs can reach 83.9% of total context usage!

# ❌ BAD: Return raw output
def search_web(query: str) -> str:
    results = web_search(query)
    return json.dumps(results)  # Could be 10k+ tokens!

# ✅ GOOD: Structured, bounded output
def search_web(query: str) -> str:
    results = web_search(query)

    # Extract only what's needed
    extracted = [
        {
            "title": r["title"],
            "snippet": r["snippet"][:200],  # Truncate
            "url": r["url"]
        }
        for r in results[:5]  # Limit count
    ]

    return json.dumps(extracted)  # ~500 tokens max

The 95% Finding

Research shows what actually drives agent performance:

┌────────────────────────────────────────────────────────────────┐
│  TOKEN USAGE        ████████████████████████████████████  80%  │
│  TOOL CALLS         █████  10%                                 │
│  MODEL CHOICE       ██  5%                                     │
│  OTHER              ██  5%                                     │
└────────────────────────────────────────────────────────────────┘

Key Insight: Optimize context efficiency BEFORE switching models.

Context Budget Management

Token Budget Calculator

def calculate_budget(model: str, task_type: str) -> dict:
    """Calculate optimal token allocation."""

    MAX_CONTEXT = {
        "gpt-5.2": 256_000,
        "gpt-5.2-mini": 128_000,
        "claude-opus-4-6": 1_000_000,
        "claude-sonnet-4-5": 1_000_000,
        "gemini-3-pro": 2_000_000,
        "gemini-3-flash": 1_000_000,
        "llama-3": 128_000,
    }

    # Opus 4.6: 128K max output tokens (up from 64K)
    MAX_OUTPUT = {
        "claude-opus-4-6": 128_000,
        "claude-sonnet-4-5": 64_000,
        "gpt-5.2": 64_000,
        "gemini-3-pro": 65_536,
    }

    # Reserve 20% for response generation
    available = MAX_CONTEXT[model] * 0.8

    # Allocation by task type
    ALLOCATIONS = {
        "chat": {
            "system": 0.05,      # 5%
            "tools": 0.05,       # 5%
            "history": 0.60,    # 60%
            "retrieval": 0.20,  # 20%
            "current": 0.10,    # 10%
        },
        "agent": {
            "system": 0.10,     # 10%
            "tools": 0.15,      # 15%
            "history": 0.30,    # 30%
            "retrieval": 0.25,  # 25%
            "observations": 0.20, # 20%
        },
    }

    alloc = ALLOCATIONS[task_type]
    return {k: int(v * available) for k, v in alloc.items()}

CC 2.1.32 Skill Budget Scaling

CC 2.1.32+ automatically scales the skill character budget to 2% of the context window:

Context Window	Skill Budget (2%)	OrchestKit Skills
200K tokens	~4,000 tokens	Standard (1200 token skill-injection budget)
500K tokens	~10,000 tokens	3x more room for skill descriptions
1M tokens (beta)	~20,000 tokens	5x skill budget, richer auto-suggest

OrchestKit's token budgets auto-scale proportionally via CLAUDE_MAX_CONTEXT env var.

Compression Triggers

COMPRESSION_CONFIG = {
    "trigger_threshold": 0.70,    # Start compressing at 70%
    "target_threshold": 0.50,     # Compress down to 50%
    "preserve_recent": 5,         # Always keep last 5 messages
    "preserve_system": True,      # Never compress system prompt
}

Attention-Aware Positioning

Template Structure

[START - HIGH ATTENTION]
## System Identity
You are a {role} specialized in {domain}.

## Critical Constraints
- NEVER {dangerous_action}
- ALWAYS {required_behavior}

[MIDDLE - LOWER ATTENTION]
## Background Context
{retrieved_documents}
{older_conversation_history}

[END - HIGH ATTENTION]
## Current Task
{recent_messages}
{user_query}

## Response Guidelines
{output_format_instructions}

Priority Positioning Rules

Identity & Constraints → START (immutable)
Critical instructions → START or END
Retrieved documents → MIDDLE (expandable)
Conversation history → MIDDLE (compressible)
Current query → END (always visible)
Output format → END (guides generation)

Metrics: Tokens-Per-Task

Optimize for total task completion, not individual requests:

@dataclass
class TaskMetrics:
    task_id: str
    total_tokens: int = 0
    request_count: int = 0
    retrieval_tokens: int = 0
    generation_tokens: int = 0

    @property
    def tokens_per_request(self) -> float:
        return self.total_tokens / max(self.request_count, 1)

    @property
    def efficiency_ratio(self) -> float:
        """Lower is better - generation vs total context."""
        return self.generation_tokens / max(self.total_tokens, 1)

Anti-pattern: Aggressive compression that loses critical details forces expensive re-fetching, consuming MORE tokens overall.

Common Pitfalls

Pitfall	Problem	Solution
Token stuffing	"More context = better"	Quality over quantity
Flat structure	No priority signaling	Use headers, positioning
Static context	Same context for all queries	Dynamic, query-relevant retrieval
Ignoring middle	Important info gets lost	Position critically
No compression	Context grows unbounded	Sliding window + summarization

Integration with OrchestKit

Agent System Prompts

Apply attention-aware positioning to agent definitions:

# Agent: backend-system-architect

[HIGH ATTENTION - START]
## Identity
Senior backend architect with 15+ years experience.

## Constraints
- NEVER suggest unvalidated security patterns
- ALWAYS consider multi-tenant isolation

[LOWER ATTENTION - MIDDLE]
## Domain Knowledge
{dynamically_loaded_patterns}

[HIGH ATTENTION - END]
## Current Task
{user_request}

Skill Loading

Progressive skill disclosure:

# Stage 1: Load skill metadata only (~100 tokens)
skill_index = load_skill_summaries()

# Stage 2: Load relevant skill on demand (~500 tokens)
if task_matches("database"):
    full_skill = load_skill("pgvector-search")

CC 2.1.7: MCP Auto-Discovery and Deferral

MCP Search Mode

CC 2.1.7 introduces intelligent MCP tool discovery. When context usage exceeds 10% of the effective window, MCPs are automatically deferred to reduce token overhead.

Context < 10%:  MCP tools immediately available
Context > 10%:  MCP tools discovered via MCPSearch (deferred loading)

Savings: ~7200 tokens per session average

How Auto-Deferral Works

The context budget monitor tracks usage against the effective window:

Below 10%: MCP tool definitions loaded in context (~1200 tokens)
Above 10%: MCP tools deferred, available via MCPSearch on-demand
State file: /tmp/claude-mcp-defer-state-{session}.json

Best Practices for MCP with Auto-Deferral

Use MCPs early - Before context fills up
Batch MCP calls - Multiple queries in one turn
Cache MCP results - Store retrieved docs in context
Monitor statusline - Watch for mcp.deferred: true

Checking MCP Deferral State

cat /tmp/claude-mcp-defer-state-${CLAUDE_SESSION_ID}.json

Related Skills

context-compression - Compression strategies and anchored summarization
multi-agent-orchestration - Context isolation across agents
rag-retrieval - Optimizing retrieved document context
prompt-caching - Reducing redundant context transmission

Version: 1.0.0 (January ) Based on: Context Engineering research, BrowseComp evaluation findings Key Metric: 80% of agent performance variance explained by token usage

Capability Details

attention-mechanics

Keywords: context window, attention, lost in the middle, token budget Solves:

Understand lost-in-the-middle effect (high attention at START/END)
Position critical info strategically
Optimize tokens-per-task not tokens-per-request

context-layers

Keywords: context anatomy, context structure, five layers Solves:

Understand 5 context layers (system, tools, docs, history, outputs)
Implement just-in-time document loading
Manage tool output truncation

budget-allocation

Keywords: token budget, context budget, allocation Solves:

Allocate tokens across context layers
Implement compression triggers at 70% utilization
Target 50% utilization after compression

context-engineering