Cost Verification Auditor

Verify that token cost estimates are within ±20% of actual Claude API usage.

When to Use

✅ Use for:

Validating token estimation systems after implementation
Pre-deployment cost accuracy checks
Debugging unexpected API bills
Periodic estimation drift detection

❌ NOT for:

Looking up model pricing (use pricing docs)
Budget planning or forecasting
Cost optimization strategies
Comparing models by price

Core Audit Process

Decision Tree

Has estimator? ──No──→ Build estimator first (see Calibration Guidelines)
      │
     Yes
      ↓
Define 3+ test cases (simple/medium/complex)
      ↓
Estimate BEFORE execution (no peeking!)
      ↓
Execute against real API
      ↓
Calculate variance: (actual - estimated) / estimated
      ↓
Variance ≤ ±20%? ──Yes──→ PASS ✓
      │
     No
      ↓
Apply fixes from Anti-Patterns section
      ↓
Re-run verification

Variance Formula

const inputVariance = (actual.inputTokens - estimate.inputTokens) / estimate.inputTokens;
const outputVariance = (actual.outputTokens - estimate.outputTokens) / estimate.outputTokens;
const costVariance = (actual.totalCost - estimate.totalCost) / estimate.totalCost;

// PASS if both input AND output within ±20%
const passed = Math.abs(inputVariance) <= 0.20 && Math.abs(outputVariance) <= 0.20;

Common Anti-Patterns

Anti-Pattern: The 500-Token Overhead Myth

Novice thinking: "Claude Code adds ~500 tokens overhead, so add that to every estimate."

Reality: Direct API calls have ~10 token overhead. The 500+ overhead is ONLY when using Claude Code's full context (system prompts, tools, conversation history).

Timeline:

Pre-2025: Many tutorials used 500+ token estimates
2025+: Direct API overhead is minimal (~10 tokens)

What to use instead:

Context	Overhead
Direct API call	~10 tokens
With system prompt	50-200 tokens
With tools/functions	100-500 tokens
Claude Code full context	500-2000 tokens

How to detect: Consistent 40-90% overestimation = overhead too high.

Anti-Pattern: Per-Node Accuracy Obsession

Novice thinking: "Every node must be within ±20% or the estimator is broken."

Reality: LLM output length is non-deterministic. Per-node output variance of 30-50% is normal. What matters is aggregate cost accuracy.

What to use instead:

Focus on total DAG cost variance (should be ±20%)
Accept per-node output variance up to ±40%
Use constrained prompts ("list exactly 3") to reduce variance

How to detect: Input estimates accurate, output varies wildly = normal LLM behavior.

Anti-Pattern: Peeking Before Estimating

Novice thinking: "Let me run the API call first to see what tokens we get, then build the estimator."

Reality: This produces perfectly-fitted estimates that fail on new prompts. Estimation must happen BEFORE execution.

Correct approach:

Estimate based on prompt length and heuristics
Execute API call
Compare variance
Adjust heuristics if needed

Calibration Guidelines

Input Token Estimation

// Calibrated 2026-01-30
const inputTokens = Math.ceil(prompt.length / CHARS_PER_TOKEN) + OVERHEAD;

Text Type	CHARS_PER_TOKEN	Notes
English prose	4.0	Most consistent
Code	3.0-3.5	Symbols tokenize differently
Mixed	3.5	Balanced (recommended default)
JSON/structured	3.0	Punctuation heavy

Output Token Estimation

Prompt Constraint	Multiplier	Notes
"List exactly N items"	0.8x input	Highly constrained
"Brief summary"	1.0x input	Moderate
"Explain in detail"	2-3x input	Expansive
Unconstrained	1.5x input	Variable

Always: Minimum 100 output tokens for any meaningful response.

Model Behavior

Model	Output Tendency
Claude Opus	Longer, more detailed
Claude Sonnet	Balanced
Claude Haiku	Concise, efficient

Quick Fixes

Symptom	Cause	Fix
Overestimating by 40%+	Overhead too high	Reduce from 500 → 10
Underestimating inputs	Chars/token too high	Reduce from 4.0 → 3.5
Output wildly varies	LLM non-determinism	Use constrained prompts
Total cost accurate but per-node off	Normal aggregation	Accept it, focus on totals

Verification Checklist

3+ test cases (simple, medium, complex)
Estimates run BEFORE API calls
Variance formula: (actual - estimated) / estimated
Target: ±20% for input AND output
Report includes actionable recommendations

References

See /references/calibration-data.md for detailed calibration tables and historical data.

cost-verification-auditor