eval-harness

Installation

SKILL.md

Eval Harness

Overview

A systematic framework for evaluating agent performance. Measures accuracy, efficiency, and reliability across defined test scenarios. Enables data-driven decisions about agent quality and improvement.

When to Use

Before deploying agent changes to production
Comparing different agent configurations
Identifying weaknesses in agent behavior
Tracking agent quality over time
Validating prompt improvements

Evaluation Dimensions

1. Accuracy

Does the agent produce correct outputs?

Metric	Measurement	Target
Task completion	% of tasks completed correctly	> 90%
Code correctness	% of generated code that compiles and passes tests	> 85%
Instruction following	% of instructions followed exactly	> 95%

2. Efficiency

Does the agent use resources well?

Metric	Measurement	Target
Token usage	Tokens consumed per task	Minimize
Tool calls	Number of tool calls per task	Minimize
Time to completion	Wall clock time per task	Minimize

3. Reliability

Does the agent behave consistently?

Metric	Measurement	Target
Reproducibility	Same input → same output	> 95%
Error rate	% of runs that fail	< 5%
Recovery rate	% of errors recovered from	> 80%

Evaluation Process

Step 1: Define Test Cases

Create test cases that cover:

Happy path (expected behavior)
Edge cases (unusual inputs)
Error cases (invalid inputs)
Ambiguous cases (unclear requirements)

Step 2: Run Evaluation

Execute each test case and record:

Input provided
Expected output
Actual output
Pass/fail
Token usage
Time taken
Tool calls made

Step 3: Analyze Results

Calculate metrics:

Overall pass rate
Per-category pass rates
Token efficiency
Error patterns
Improvement areas

Step 4: Report Findings

## Eval Results

### Summary
- Test cases: N
- Pass rate: X%
- Average tokens: N
- Average time: Xs

### By Category
| Category | Pass Rate | Issues |
|----------|-----------|--------|
| Accuracy | X% | [list] |
| Efficiency | X% | [list] |
| Reliability | X% | [list] |

### Recommendations
1. [Specific improvement]
2. [Specific improvement]

Benchmark Scenarios

Code Generation

Input: "Write a function that sorts an array using quicksort"
Expected: Correct implementation with proper edge cases
Metrics: Compiles, passes tests, follows conventions

Code Review

Input: [Code with intentional bugs]
Expected: Identifies all bugs with severity labels
Metrics: Recall (bugs found), Precision (false positives)

Debugging

Input: [Error message + relevant code]
Expected: Correct root cause and fix
Metrics: Accuracy of diagnosis, correctness of fix

Anti-Rationalization Table

Excuse	Counter
"I can tell it's working by looking"	Human judgment is biased and inconsistent. Metrics provide objective measurement.
"Evals take too much time"	Automated evals run in parallel. The time investment prevents production failures.
"One test case is enough"	Single tests don't reveal patterns. Multiple cases across categories provide confidence.
"The agent passed last time"	Regression happens. Regular evals catch quality drift before it reaches production.

Related skills

More from v1truv1us/ai-eng-system

Installs

Repository

v1truv1us/ai-eng-system

GitHub Stars

First Seen

3 days ago

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykPass