agent-evaluation
Evaluation Methods for Claude Code Agents
Evaluation of agent systems requires different approaches than traditional software or even standard language model applications. Agents make dynamic decisions, are non-deterministic between runs, and often lack single correct answers. Effective evaluation must account for these characteristics while providing actionable feedback. A robust evaluation framework enables continuous improvement, catches regressions, and validates that context engineering choices achieve intended effects.
Core Concepts
Agent evaluation requires outcome-focused approaches that account for non-determinism and multiple valid paths. Multi-dimensional rubrics capture various quality aspects: factual accuracy, completeness, citation accuracy, source quality, and tool efficiency. LLM-as-judge provides scalable evaluation while human evaluation catches edge cases.
The key insight is that agents may find alternative paths to goals—the evaluation should judge whether they achieve right outcomes while following reasonable processes.
Performance Drivers: The 95% Finding Research on the BrowseComp evaluation (which tests browsing agents' ability to locate hard-to-find information) found that three factors explain 95% of performance variance:
| Factor | Variance Explained | Implication |
|---|---|---|
| Token usage | 80% | More tokens = better performance |
| Number of tool calls | ~10% | More exploration helps |
| Model choice | ~5% | Better models multiply efficiency |
Implications for Claude Code development:
- Token budgets matter: Evaluate with realistic token constraints
- Model upgrades beat token increases: Upgrading models provides larger gains than increasing token budgets
- Multi-agent validation: Validates architectures that distribute work across subagents with separate context windows
Progressive Loading
L2 Content (loaded when methodology details needed):
- See: references/methodologies.md
- Evaluation Challenges
- Evaluation Rubric Design
- Evaluation Methodologies
- Test Set Design
- Context Engineering Evaluation
L3 Content (loaded when advanced techniques or examples needed):
- See: references/advanced.md
- Advanced Evaluation: LLM-as-Judge
- Evaluation Metrics Reference
- Bias Mitigation Techniques
- LLM-as-Judge Implementation Patterns
- Metric Selection Guide
- Practical Examples
More from zpankz/mcp-skillset
network-meta-analysis-appraisal
Systematically appraise network meta-analysis papers using integrated 200-point checklist (PRISMA-NMA, NICE DSU TSD 7, ISPOR-AMCP-NPC, CINeMA) with triple-validation methodology, automated PDF extraction, semantic evidence matching, and concordance analysis. Use when evaluating NMA quality for peer review, guideline development, HTA, or reimbursement decisions.
16software-architecture
Guide for quality focused software architecture. This skill should be used when users want to write code, design architecture, analyze code, in any case that relates to software development.
13cursor-skills
Cursor is an AI-powered code editor and development environment that combines intelligent coding assistance with enterprise-grade features and workflow automation. It extends beyond basic AI code comp...
13textbook-grounding
Orthogonally-integrated Hegelian syntopical analysis for SAQ/VIVA/concept grounding with systematic textbook citations. Implements thesis extraction → antithesis identification → abductive synthesis across multiple authoritative sources. Tensor-integrated with /m command: activates S×T×L synergies (textbook-grounding × pdf-search × qmd = 0.95). Triggers on requests for model SAQ responses, VIVA preparation, concept explanations requiring textbook evidence, or any PEX exam content needing systematic cross-reference validation.
12obsidian-process
This skill should be used when batch processing Obsidian markdown vaults. Handles wikilink extraction, tag normalization, frontmatter CRUD operations, and vault analysis. Use for vault-wide transformations, link auditing, tag standardization, metadata management, and migration workflows. Integrates with obsidian-markdown for syntax validation and obsidian-data-importer for structured imports.
12terminal-ui-design
Create distinctive, production-grade terminal user interfaces with high design quality. Use this skill when the user asks to build CLI tools, TUI applications, or terminal-based interfaces. Generates creative, polished code that avoids generic terminal aesthetics.
10