eval-harness
Eval Harness
Overview
A systematic framework for evaluating agent performance. Measures accuracy, efficiency, and reliability across defined test scenarios. Enables data-driven decisions about agent quality and improvement.
When to Use
- Before deploying agent changes to production
- Comparing different agent configurations
- Identifying weaknesses in agent behavior
- Tracking agent quality over time
- Validating prompt improvements
Evaluation Dimensions
1. Accuracy
Does the agent produce correct outputs?
| Metric | Measurement | Target |
|---|---|---|
| Task completion | % of tasks completed correctly | > 90% |
| Code correctness | % of generated code that compiles and passes tests | > 85% |
| Instruction following | % of instructions followed exactly | > 95% |
2. Efficiency
Does the agent use resources well?
| Metric | Measurement | Target |
|---|---|---|
| Token usage | Tokens consumed per task | Minimize |
| Tool calls | Number of tool calls per task | Minimize |
| Time to completion | Wall clock time per task | Minimize |
3. Reliability
Does the agent behave consistently?
| Metric | Measurement | Target |
|---|---|---|
| Reproducibility | Same input → same output | > 95% |
| Error rate | % of runs that fail | < 5% |
| Recovery rate | % of errors recovered from | > 80% |
Evaluation Process
Step 1: Define Test Cases
Create test cases that cover:
- Happy path (expected behavior)
- Edge cases (unusual inputs)
- Error cases (invalid inputs)
- Ambiguous cases (unclear requirements)
Step 2: Run Evaluation
Execute each test case and record:
- Input provided
- Expected output
- Actual output
- Pass/fail
- Token usage
- Time taken
- Tool calls made
Step 3: Analyze Results
Calculate metrics:
- Overall pass rate
- Per-category pass rates
- Token efficiency
- Error patterns
- Improvement areas
Step 4: Report Findings
## Eval Results
### Summary
- Test cases: N
- Pass rate: X%
- Average tokens: N
- Average time: Xs
### By Category
| Category | Pass Rate | Issues |
|----------|-----------|--------|
| Accuracy | X% | [list] |
| Efficiency | X% | [list] |
| Reliability | X% | [list] |
### Recommendations
1. [Specific improvement]
2. [Specific improvement]
Benchmark Scenarios
Code Generation
Input: "Write a function that sorts an array using quicksort"
Expected: Correct implementation with proper edge cases
Metrics: Compiles, passes tests, follows conventions
Code Review
Input: [Code with intentional bugs]
Expected: Identifies all bugs with severity labels
Metrics: Recall (bugs found), Precision (false positives)
Debugging
Input: [Error message + relevant code]
Expected: Correct root cause and fix
Metrics: Accuracy of diagnosis, correctness of fix
Anti-Rationalization Table
| Excuse | Counter |
|---|---|
| "I can tell it's working by looking" | Human judgment is biased and inconsistent. Metrics provide objective measurement. |
| "Evals take too much time" | Automated evals run in parallel. The time investment prevents production failures. |
| "One test case is enough" | Single tests don't reveal patterns. Multiple cases across categories provide confidence. |
| "The agent passed last time" | Regression happens. Regular evals catch quality drift before it reaches production. |
More from v1truv1us/ai-eng-system
coolify-deploy
Deploy applications to Coolify self-hosting platform. Use when deploying to Coolify, configuring build settings, setting environment variables, managing health checks, or performing rollbacks.
105prompt-refinement
Transform prompts into structured TCRO format with phase-specific clarification. Automatically invoked by /ai-eng/research, /ai-eng/plan, /ai-eng/work, and /ai-eng/specify commands. Use when refining vague prompts, structuring requirements, or enhancing user input quality before execution.
16text-cleanup
Comprehensive patterns and techniques for removing AI-generated verbosity and slop
15plugin-dev
This skill should be used when creating extensions for Claude Code or OpenCode, including plugins, commands, agents, skills, and custom tools. Covers both platforms with format specifications, best practices, and the ai-eng-system build system.
14incentive-prompting
Research-backed prompting techniques for improved AI response quality (+45-115% improvement). Use when optimizing prompts, enhancing agent instructions, or when maximum response quality is critical. Invoked by /ai-eng/optimize command. Includes expert persona, stakes language, step-by-step reasoning, challenge framing, and self-evaluation techniques.
10git-worktree
Manage Git worktrees for parallel development. Use when creating isolated workspaces for parallel feature work, running multiple Claude sessions simultaneously, or managing concurrent development tasks.
9