agent-evaluation
SKILL.md
Agent Evaluation Methods
Agent evaluation requires different approaches than traditional software. Agents are non-deterministic, may take different valid paths, and lack single correct answers.
Key Finding: 95% Performance Drivers
Research on BrowseComp found three factors explain 95% of variance:
| Factor | Variance | Implication |
|---|---|---|
| Token usage | 80% | More tokens = better performance |
| Tool calls | ~10% | More exploration helps |
| Model choice | ~5% | Better models multiply efficiency |
Implications: Model upgrades beat token increases. Multi-agent architectures validate.
Multi-Dimensional Rubric
| Dimension | Excellent | Good | Acceptable | Failed |
|---|---|---|---|---|
| Factual accuracy | All correct | Minor errors | Some errors | Wrong |
| Completeness | All aspects | Most aspects | Key aspects | Missing |
| Citation accuracy | All match | Most match | Some match | Wrong |
| Tool efficiency | Optimal | Good | Adequate | Wasteful |
LLM-as-Judge
evaluation_prompt = """
Task: {task_description}
Agent Output: {agent_output}
Ground Truth: {ground_truth}
Evaluate on:
1. Factual accuracy (0-1)
2. Completeness (0-1)
3. Citation accuracy (0-1)
4. Tool efficiency (0-1)
Provide scores and reasoning.
"""
Test Set Design
test_set = [
{"name": "simple", "complexity": "simple",
"input": "What is capital of France?"},
{"name": "medium", "complexity": "medium",
"input": "Compare Apple and Microsoft revenue"},
{"name": "complex", "complexity": "complex",
"input": "Analyze Q1-Q4 sales trends"},
{"name": "very_complex", "complexity": "very_complex",
"input": "Research AI tech, evaluate impact, recommend strategy"}
]
Evaluation Pipeline
def evaluate_agent(agent, test_set):
results = []
for test in test_set:
output = agent.run(test["input"])
scores = llm_judge(output, test)
results.append({
"test": test["name"],
"scores": scores,
"passed": scores["overall"] >= 0.7
})
return results
Complexity Stratification
| Level | Characteristics |
|---|---|
| Simple | Single tool call |
| Medium | Multiple tool calls |
| Complex | Many calls, ambiguity |
| Very Complex | Extended interaction, deep reasoning |
Context Engineering Evaluation
Test context strategies systematically:
- Run agents with different strategies on same tests
- Compare quality scores, token usage, efficiency
- Identify degradation cliffs at different context sizes
Continuous Evaluation
- Run evaluations on all agent changes
- Track metrics over time
- Set alerts for quality drops
- Sample production interactions
Avoiding Pitfalls
| Pitfall | Solution |
|---|---|
| Path overfitting | Evaluate outcomes, not steps |
| Ignoring edge cases | Include diverse scenarios |
| Single metric | Multi-dimensional rubrics |
| Ignoring context | Test realistic context sizes |
| No human review | Supplement automated eval |
Best Practices
- Use multi-dimensional rubrics
- Evaluate outcomes, not specific paths
- Cover complexity levels
- Test with realistic context sizes
- Run evaluations continuously
- Supplement LLM with human review
- Track metrics for trends
- Set clear pass/fail thresholds
Weekly Installs
38
Repository
eyadsibai/ltkFirst Seen
Jan 28, 2026
Security Audits
Installed on
gemini-cli32
opencode31
codex30
github-copilot29
antigravity24
kimi-cli24