cost-aware-pipeline
Installation
SKILL.md
Cost-Aware LLM Pipeline
Model Routing Strategy
Route tasks to the cheapest model that can handle them reliably.
Pricing Reference (per 1M tokens, USD)
| Model | Input | Output | Relative Cost |
|---|---|---|---|
| Haiku | $0.80 | $4.00 | 1x (baseline) |
| Sonnet | $3.00 | $15.00 | ~4x |
| Opus | $15.00 | $75.00 | ~19x |
Routing Rules
| Task Complexity | Route To | Examples |
|---|---|---|
| Simple (< 100 lines, clear pattern) | Haiku | File renaming, simple search, format conversion, status checks |
| Moderate (100-500 lines, some judgment) | Sonnet | Code review, test writing, refactoring, documentation |
| Complex (500+ lines, deep reasoning) | Opus | Architecture design, debugging subtle issues, multi-file refactoring |
| Creative (open-ended, high quality bar) | Opus | System design, novel algorithms, critical security review |
Implementation with Claude Code Agent Tool
# In agent orchestration, specify model based on task complexity
def route_to_model(task_description: str, estimated_complexity: str) -> str:
"""Return model parameter for Agent tool."""
routing = {
"simple": "haiku",
"moderate": "sonnet",
"complex": "opus",
"creative": "opus",
}
return routing.get(estimated_complexity, "sonnet")
When spawning sub-agents:
- Use
model: "haiku"for Explore agents doing simple file searches - Use
model: "sonnet"(default) for most implementation work - Use
model: "opus"only for architecture reviews, complex debugging, distinguished-engineer tasks
Cost Savings Examples
| Scenario | Before (all Opus) | After (routed) | Savings |
|---|---|---|---|
| 10 file searches | $0.15 | $0.008 (haiku) | 95% |
| 5 code reviews | $0.75 | $0.20 (sonnet) | 73% |
| 1 architecture review | $0.15 | $0.15 (opus) | 0% |
| Typical session | $1.05 | $0.36 | 66% |
Narrow Retry Strategy
When an LLM call fails, retry intelligently — not blindly.
Principles
- Retry the minimum scope — Don't re-run the entire pipeline, retry only the failed step
- Escalate model on retry — If haiku fails, retry with sonnet, not haiku again
- Add context on retry — Include the error in the retry prompt
- Max 2 retries — If it fails 3 times, escalate to human
Retry Escalation Ladder
Attempt 1: haiku (cheapest)
| fail
v
Attempt 2: sonnet + error context
| fail
v
Attempt 3: opus + error context + examples
| fail
v
Escalate to human
Implementation
RETRY_LADDER = [
{"model": "haiku", "extra_context": None},
{"model": "sonnet", "extra_context": "Previous attempt failed: {error}"},
{"model": "opus", "extra_context": "Two attempts failed. Errors: {errors}. Please reason carefully."},
]
async def call_with_retry(prompt: str, task: str) -> str:
errors = []
for attempt in RETRY_LADDER:
try:
result = await call_llm(
model=attempt["model"],
prompt=prompt,
extra_context=attempt["extra_context"].format(
error=errors[-1] if errors else "",
errors="; ".join(errors)
) if attempt["extra_context"] and errors else prompt,
)
return result
except Exception as e:
errors.append(str(e))
raise EscalationNeeded(f"Failed after {len(RETRY_LADDER)} attempts: {errors}")
Prompt Caching Strategies
Structure prompts to maximize cache hits and reduce costs.
Cache-Friendly Prompt Structure
+-------------------------------------+
| STATIC SYSTEM PROMPT (cached) | <- Rarely changes, high cache hit rate
| - Agent instructions |
| - Tool definitions |
| - Project context (CLAUDE.md) |
+-------------------------------------+
| SEMI-STATIC CONTEXT (cached) | <- Changes per-task, not per-turn
| - File contents being worked on |
| - Relevant documentation |
| - Test results |
+-------------------------------------+
| DYNAMIC CONTENT (not cached) | <- Changes every turn
| - User message |
| - Latest tool results |
| - Conversation tail |
+-------------------------------------+
Key Principles
- Front-load static content — Put rarely-changing content at the start of the prompt
- Batch file reads — Read all needed files at once, not one at a time
- Minimize prompt churn — Avoid reformatting the same content differently each turn
- Use system prompts for stable context — CLAUDE.md, agent definitions, skill bodies
Cost Impact of Caching
With Anthropic's prompt caching:
- Cached tokens cost 90% less on input
- Cache write costs 25% more (one-time)
- Cache TTL: 5 minutes (refreshed on use)
Strategy: Keep system prompt + project context stable across turns. Only the user message and tool results should change.
Budget Planning
Estimating Session Costs
| Session Type | Typical Tokens | Estimated Cost |
|---|---|---|
| Quick fix (5 min) | 50K in / 10K out | $0.15 - $0.90 |
| Feature implementation (30 min) | 500K in / 100K out | $1.50 - $9.00 |
| Deep debugging (1 hr) | 1M in / 200K out | $3.00 - $18.00 |
| Multi-agent swarm (2 hr) | 5M in / 1M out | $15.00 - $90.00 |
Setting Budgets
# Per-session cost cap (used by agent-ops skill)
export AGENT_COST_CAP=5.00
# Per-agent cost cap for sub-agents
export SUBAGENT_COST_CAP=1.00
# Warning threshold (fraction of cap)
export AGENT_COST_CAP_WARN=0.80
Monitoring
# View today's costs
grep "$(date +%Y-%m-%d)" $HOME/{{TOOL_DIR}}/metrics/costs.jsonl | \
python3 -c "import sys,json; rows=[json.loads(l) for l in sys.stdin]; \
print(f'Sessions: {len(set(r[\"session_id\"] for r in rows))}'); \
print(f'Total: \${sum(r[\"estimated_cost_usd\"] for r in rows):.4f}')"
Related skills
More from stevengonsalvez/agents-in-a-box
frontend-design
Frontend design skill for UI/UX implementation - generates distinctive, production-grade interfaces
31commit
Create well-formatted git commits for changes made during the session
14reflect
|
2plan-tdd
Create a test-driven development plan for the project
1oracle
Best practices for using the oracle CLI (prompt + file bundling, engines, sessions, and file attachment patterns).
1gh-issue
Analyze and fix the specified GitHub issue
1