agent-evaluation

Installation
SKILL.md

Agent Evaluation Methods

Agent evaluation requires different approaches than traditional software. Agents are non-deterministic, may take different valid paths, and lack single correct answers.

Key Finding: 95% Performance Drivers

Research on BrowseComp found three factors explain 95% of variance:

Factor Variance Implication
Token usage 80% More tokens = better performance
Tool calls ~10% More exploration helps
Model choice ~5% Better models multiply efficiency

Implications: Model upgrades beat token increases. Multi-agent architectures validate.

Multi-Dimensional Rubric

Installs
62
Repository
eyadsibai/ltk
GitHub Stars
6
First Seen
Jan 28, 2026
agent-evaluation — eyadsibai/ltk