agent-evaluation
Installation
SKILL.md
Agent Evaluation Methods
Agent evaluation requires different approaches than traditional software. Agents are non-deterministic, may take different valid paths, and lack single correct answers.
Key Finding: 95% Performance Drivers
Research on BrowseComp found three factors explain 95% of variance:
| Factor | Variance | Implication |
|---|---|---|
| Token usage | 80% | More tokens = better performance |
| Tool calls | ~10% | More exploration helps |
| Model choice | ~5% | Better models multiply efficiency |
Implications: Model upgrades beat token increases. Multi-agent architectures validate.