agent-comparison
Agent Comparison Skill
Compare agent variants through controlled A/B benchmarks. Runs identical tasks on both agents, grades output quality with domain-specific checklists, and reports total session token cost to a working solution. This skill is exclusively for agent variant comparison — use agent-evaluation for single-agent assessment, and skill-eval for skill testing.
Reference Loading Table
| Signal | Load These Files | Why |
|---|---|---|
| tasks related to this reference | benchmark-tasks.md |
Loads detailed guidance from benchmark-tasks.md. |
| example-driven tasks, errors | examples-and-errors.md |
Loads detailed guidance from examples-and-errors.md. |
| tasks related to this reference | grading-rubric.md |
Loads detailed guidance from grading-rubric.md. |
| tasks related to this reference | methodology.md |
Loads detailed guidance from methodology.md. |
| tasks related to this reference | optimization-guide.md |
Loads detailed guidance from optimization-guide.md. |
| tasks related to this reference | optimize-phase.md |
Loads detailed guidance from optimize-phase.md. |
| tasks related to this reference | report-template.md |
Loads detailed guidance from report-template.md. |
Instructions
See
references/examples-and-errors.mdfor error handling. Seereferences/optimize-phase.mdfor Phase 5 OPTIMIZE full procedure. Seereferences/methodology.mdfor December 2024 benchmark data.
Phase 1: PREPARE
Goal: Create benchmark environment and validate both agent variants exist.
Read and follow the repository CLAUDE.md before starting any execution.
Step 1: Analyze original agent
wc -l agents/{original-agent}.md
grep "^## " agents/{original-agent}.md
grep -c '```' agents/{original-agent}.md
Step 2: Create or validate compact variant
If creating a compact variant, preserve:
- YAML frontmatter (name, description, routing)
- Core patterns and principles
- Error handling philosophy
Remove or condense:
- Lengthy code examples (keep 1-2 representative per pattern)
- Verbose explanations (condense to bullet points)
- Redundant instructions and changelogs
Target 10-15% of original size while keeping essential knowledge. Remove redundancy, not capability — stripping error handling patterns or concurrency guidance creates an unfair comparison because the compact agent is missing essential knowledge rather than expressing it concisely.
Step 3: Validate compact variant structure
head -20 agents/{compact-agent}.md | grep -E "^(name|description):"
echo "Original: $(wc -l < agents/{original-agent}.md) lines"
echo "Compact: $(wc -l < agents/{compact-agent}.md) lines"
Step 4: Create benchmark directory and prepare prompts
mkdir -p benchmark/{task-name}/{full,compact}
Write the task prompt ONCE, then copy it for both agents. Both agents must receive the exact same task description, character-for-character, because different requirements produce different solutions and invalidate all measurements.
Keep benchmark scripts simple — no speculative features or configurable frameworks that were not requested.
Gate: Both agent variants exist with valid YAML frontmatter. Benchmark directories created. Identical task prompts written. Proceed only when gate passes.
Phase 2: BENCHMARK
Goal: Run identical tasks on both agents, capturing all metrics.
Step 1: Run simple task benchmark (2-3 tasks)
Use algorithmic problems with clear specifications (e.g., Advent of Code Day 1-6). Simple tasks establish a baseline — if an agent fails here, it has fundamental issues. Running multiple simple tasks is necessary because a single data point is sensitive to task selection bias and cannot distinguish luck from systematic quality.
Spawn both agents in parallel using Task tool:
Task(
prompt="[exact task prompt]\nSave to: benchmark/{task}/full/",
subagent_type="{full-agent}"
)
Task(
prompt="[exact task prompt]\nSave to: benchmark/{task}/compact/",
subagent_type="{compact-agent}"
)
Run in parallel to avoid caching effects or system load variance skewing results.
Step 2: Run complex task benchmark (1-2 tasks)
Use production-style problems that require concurrency, error handling, edge case anticipation — these are where quality differences emerge because simple tasks mask differences in edge case handling. See references/benchmark-tasks.md for standard tasks.
Recommended complex tasks:
- Worker Pool: Rate limiting, graceful shutdown, panic recovery
- LRU Cache with TTL: Generics, background goroutines, zero-value semantics
- HTTP Service: Middleware chains, structured errors, health checks
Step 3: Capture metrics for each run
Record immediately after each agent completes — delayed recording loses precision. Track input/output token counts per turn where visible, since total session cost (not just prompt size) is what matters.
| Metric | Full Agent | Compact Agent |
|---|---|---|
| Tests pass | X/X | X/X |
| Race conditions | X | X |
| Code lines (main) | X | X |
| Test lines | X | X |
| Session tokens | X | X |
| Wall-clock time | Xm Xs | Xm Xs |
| Retry cycles | X | X |
Step 4: Run tests with race detector
cd benchmark/{task-name}/full && go test -race -v -count=1
cd benchmark/{task-name}/compact && go test -race -v -count=1
Use -count=1 to disable test caching. All generated code must pass the same test suite with the -race flag because race conditions are automatic quality failures.
Gate: Both agents completed all tasks. Metrics captured for every run. Test output saved. Proceed only when gate passes.
Phase 3: GRADE
Goal: Score code quality beyond pass/fail using domain-specific checklists.
Step 1: Create quality checklist BEFORE reviewing code
Define criteria before seeing results to prevent bias — inventing criteria after seeing one agent's output skews the comparison. See references/grading-rubric.md for standard rubrics.
| Criterion | 5/5 | 3/5 | 1/5 |
|---|---|---|---|
| Correctness | All tests pass, no race conditions | Some failures | Broken |
| Error Handling | Comprehensive, production-ready | Adequate | None |
| Idioms | Exemplary for the language | Acceptable | Anti-patterns |
| Documentation | Thorough | Adequate | None |
| Testing | Comprehensive coverage | Basic | Minimal |
Step 2: Score each solution independently
Grade each agent's code on all five criteria. Score one agent completely before starting the other. Report facts and show command output rather than describing it — every claim must be backed by measurable data (tokens, test counts, quality scores).
## {Agent} Solution - {Task}
| Criterion | Score | Notes |
|-----------|-------|-------|
| Correctness | X/5 | |
| Error Handling | X/5 | |
| Idioms | X/5 | |
| Documentation | X/5 | |
| Testing | X/5 | |
| **Total** | **X/25** | |
Step 3: Document specific bugs with production impact
For each bug found, record:
### Bug: {description}
- Agent: {which agent}
- What happened: {behavior}
- Correct behavior: {expected}
- Production impact: {consequence}
- Test coverage: {did tests catch it? why not?}
"Tests pass" is necessary but not sufficient — production bugs often pass tests. Apply the domain-specific quality checklist rather than relying only on test pass rates, because tests can miss goroutine leaks, wrong semantics, and other production issues.
Step 4: Calculate effective cost
effective_cost = total_tokens * (1 + bug_count * 0.25)
An agent using 194k tokens with 0 bugs has better economics than one using 119k tokens with 5 bugs requiring fixes. The metric that matters is total cost to working, production-quality solution — not prompt size, because prompt is a one-time cost while reasoning tokens dominate sessions. Check quality scores before claiming token savings, since savings that come from cutting corners are not real savings.
Gate: Both solutions graded with evidence. Specific bugs documented with production impact. Effective cost calculated. Proceed only when gate passes.
Phase 4: REPORT
Goal: Generate comparison report with evidence-backed verdict.
Step 1: Generate comparison report
Use the report template from references/report-template.md. Include:
- Executive summary with clear winner per metric
- Per-task results with metrics tables
- Token economics analysis (one-time prompt cost vs session cost)
- Specific bugs found and their production impact
- Verdict based on total evidence
Step 2: Run comparison analysis
python3 ${CLAUDE_SKILL_DIR}/scripts/compare.py benchmark/{task-name}/
Step 3: Analyze token economics
The key economic insight: agent prompts are a one-time cost per session. Everything after — reasoning, code generation, debugging, retries — costs tokens on every turn. When a micro agent produces correct code, it uses approximately the same total tokens. The savings appear only when it cuts corners.
| Pattern | Description |
|---|---|
| Large agent, low churn | High initial cost, fewer retries, less debugging |
| Small agent, high churn | Low initial cost, more retries, more debugging |
Our data showed a 57-line agent used 69.5k tokens vs 69.6k for a 3,529-line agent on the same correct solution — prompt size alone does not determine cost.
Step 4: State verdict with evidence
The verdict must be backed by data. Include:
- Which agent won on simple tasks (expected: equivalent)
- Which agent won on complex tasks (expected: full agent)
- Total session cost comparison
- Effective cost comparison (with bug penalty)
- Clear recommendation for when to use each variant
See references/methodology.md for the complete testing methodology with December 2024 data.
Step 5: Clean up
Remove temporary benchmark files and debug outputs. Keep only the comparison report and generated code.
Gate: Report generated with all metrics. Verdict stated with evidence. Report saved to benchmark directory.
Phase 5: OPTIMIZE (optional — invoked explicitly)
Goal: Run an automated optimization loop that improves a markdown target's frontmatter description using trigger-rate eval tasks, then selects the best measured variants through beam search or single-path search.
Invoke when the user says "optimize this skill", "optimize the description", or "run autoresearch". The existing manual A/B comparison (Phases 1-4) remains the path for full agent benchmarking.
See
references/optimize-phase.mdfor the full 9-step procedure, all CLI flags, recommended modes, live eval defaults, current reality check, and optional extensions.
Gate: Optimization complete. Results reviewed. Cherry-picked improvements applied and verified against full task set. Results recorded.
References
${CLAUDE_SKILL_DIR}/references/methodology.md: Complete testing methodology with December 2024 data${CLAUDE_SKILL_DIR}/references/grading-rubric.md: Detailed grading criteria and quality checklists${CLAUDE_SKILL_DIR}/references/benchmark-tasks.md: Standard benchmark task descriptions and prompts${CLAUDE_SKILL_DIR}/references/report-template.md: Comparison report template with all required sections${CLAUDE_SKILL_DIR}/references/optimize-phase.md: Full Phase 5 OPTIMIZE procedure (autoresearch loop, CLI flags, beam search, reality check)${CLAUDE_SKILL_DIR}/references/examples-and-errors.md: Error handling for common benchmark failures
More from notque/claude-code-toolkit
generate-claudemd
Generate project-specific CLAUDE.md from repo analysis.
12fish-shell-config
Fish shell configuration and PATH management.
12pptx-generator
PPTX presentation generation with visual QA: slides, pitch decks.
12codebase-overview
Systematic codebase exploration and architecture mapping.
10image-to-video
FFmpeg-based video creation from image and audio.
9data-analysis
Decision-first data analysis with statistical rigor gates.
9