skill-forge-benchmark
Installation
SKILL.md
Skill Benchmarking & Performance Tracking
Measure and compare skill performance across iterations with statistical rigor using multiple trials, variance analysis, and trend tracking.
Process
Step 1: Define Benchmark Configuration
Accept configuration as:
- Existing eval set: Path to
evals/evals.json(from/skill-forge eval) - Benchmark config: Custom config with trial count and thresholds
Benchmark config schema:
{
"skill_name": "my-skill",
"skill_path": "./my-skill",
"eval_set_path": "./evals/evals.json",
"trials_per_eval": 3,
"baseline_type": "no_skill",
"previous_benchmark": null,
"thresholds": {
"min_pass_rate": 0.8,
"max_avg_tokens": 100000,
"max_avg_duration_seconds": 120,
"min_improvement_ratio": 1.0
}
}
Step 2: Execute Benchmark Runs
For each eval, run trials_per_eval times (default: 3) to get reliable metrics:
- Execute with-skill runs (3x per eval)
- Execute baseline runs (3x per eval)
- Capture per-run: pass/fail, token count, duration
- Save each run's
timing.jsonandgrading.json
Use agents/skill-forge-executor.md for parallel execution where possible.
Step 3: Aggregate Results
Run python scripts/aggregate_benchmark.py <workspace>/iteration-<N> --skill-name <name>:
Output benchmark.json schema:
{
"skill_name": "my-skill",
"iteration": 1,
"timestamp": "2026-03-06T12:00:00Z",
"summary": {
"total_evals": 10,
"with_skill": {
"pass_rate": 0.87,
"pass_rate_std": 0.05,
"avg_tokens": 45000,
"avg_duration_seconds": 34.2
},
"baseline": {
"pass_rate": 0.60,
"pass_rate_std": 0.08,
"avg_tokens": 62000,
"avg_duration_seconds": 52.1
},
"improvement_ratio": 1.45,
"token_savings_ratio": 0.73,
"time_savings_ratio": 0.66
},
"per_eval": [
{
"eval_id": 0,
"eval_name": "basic-trigger",
"with_skill": {"pass_rate": 1.0, "avg_tokens": 30000, "avg_duration_seconds": 20.1},
"baseline": {"pass_rate": 0.67, "avg_tokens": 50000, "avg_duration_seconds": 45.0},
"trials": 3
}
],
"thresholds_met": {
"min_pass_rate": true,
"max_avg_tokens": true,
"max_avg_duration_seconds": true,
"min_improvement_ratio": true
}
}
Step 4: Compare with Previous Iterations
If previous_benchmark is provided or prior iteration-<N-1> exists:
- Load previous
benchmark.json - Calculate delta per metric:
- Pass rate change
- Token usage change
- Duration change
- New regressions (evals that passed before but fail now)
- New improvements (evals that failed before but pass now)
Step 5: Generate Benchmark Report
# Benchmark Report: [skill-name]
## Iteration [N] vs [N-1]
### Summary
| Metric | Current | Previous | Delta | Threshold | Status |
|--------|---------|----------|-------|-----------|--------|
| Pass Rate | 87% | 78% | +9% | >= 80% | PASS |
| Avg Tokens | 45K | 52K | -13% | <= 100K | PASS |
| Avg Time | 34s | 41s | -17% | <= 120s | PASS |
| Improvement | 1.45x | 1.30x | +0.15x | >= 1.0x | PASS |
### Regressions (Action Required)
| Eval | Previous | Current | Notes |
|------|----------|---------|-------|
| eval-5 | PASS | FAIL | Output missing required section |
### Improvements
| Eval | Previous | Current | Notes |
|------|----------|---------|-------|
| eval-3 | FAIL | PASS | Error handling now works |
### Per-Eval Detail
[Full breakdown table]
### Variance Analysis
| Eval | Pass Rate | Std Dev | Trials | Reliability |
|------|-----------|---------|--------|-------------|
| eval-0 | 100% | 0.00 | 3 | High |
| eval-1 | 67% | 0.47 | 3 | Low (investigate) |
### Recommendations
[Based on regressions, low-reliability evals, and threshold failures]
Step 6: Threshold Gating
If any threshold fails:
- Flag as FAIL with specific threshold details
- List which evals caused the failure
- Recommend running
/skill-forge evolveto address issues - Do NOT approve for publish until thresholds pass
Error Handling
- Flaky trials: If a trial times out or crashes, exclude it from variance calculation and note
"trials_completed"vs"trials_requested"in per-eval results - Insufficient trials: If fewer than 2 trials complete for an eval, flag variance as
"unreliable"in the report - Missing baseline: If baseline runs fail entirely, report with-skill results only and skip improvement_ratio
- Threshold edge cases: If pass_rate equals the threshold exactly, treat as PASS
Integration with Other Sub-Skills
- skill-forge-eval: Provides the eval set and grading infrastructure
- skill-forge-evolve: Receives benchmark failures as improvement targets
- skill-forge-publish: Requires benchmark pass (score >= thresholds) before publish
- skill-forge-review: Can include benchmark summary in review report