skill-forge-eval
Skill Evaluation Pipeline
Run structured evaluations against Claude Code skills to verify triggering, correctness, and quality using a multi-agent pipeline.
Process
Step 1: Define Eval Set
Accept eval definitions from:
- Path to eval set JSON:
evals/evals.jsonor user-specified file - Inline prompts: User provides eval queries directly
- Auto-generated: Generate from skill description (see Step 1b)
Eval set JSON schema:
{
"skill_name": "my-skill",
"skill_path": "./my-skill",
"evals": [
{
"eval_id": 0,
"eval_name": "descriptive-name",
"prompt": "The user's task prompt",
"input_files": [],
"assertions": [
{
"name": "output-has-score",
"check": "Output contains a numeric score between 0-100",
"weight": 1.0
}
],
"should_trigger": true
}
]
}
Step 1b: Auto-Generate Eval Set
If no eval set exists, generate one:
- Read the skill's SKILL.md description and instructions
- Run
python scripts/generate_eval_set.py <skill-path>to produce a starter set - Present the generated set to the user for review and editing
- User approves or modifies before proceeding
Step 2: Set Up Workspace
Create the eval workspace outside the skill directory to avoid confusing eval artifacts with skill files. Use a sibling directory or a dedicated location:
eval-workspace/
iteration-1/
eval-0/
eval_metadata.json # Assertions and config for this eval
with_skill/
outputs/ # Skill execution outputs
timing.json # Token count + duration
grading.json # Assertion results + evidence
baseline/
outputs/
timing.json
grading.json
eval-1/
eval_metadata.json
with_skill/
outputs/
timing.json
grading.json
baseline/
outputs/
timing.json
grading.json
benchmark.json # Aggregated metrics
benchmark.md # Human-readable report
For each eval directory, create eval_metadata.json from the eval set entry:
{
"eval_id": 0,
"eval_name": "descriptive-name",
"prompt": "The user's task prompt",
"assertions": [...],
"should_trigger": true
}
Step 3: Execute Eval Runs
For each eval in the set, spawn two parallel runs:
With-skill run (delegate to agents/skill-forge-executor.md):
Execute this task:
- Skill path: <path-to-skill>
- Task: <eval prompt>
- Input files: <eval files if any, or "none">
- Save outputs to: <workspace>/iteration-<N>/eval-<ID>/with_skill/outputs/
- Outputs to save: <what the assertions check>
Baseline run (delegate to agents/skill-forge-executor.md):
- For new skills: run without the skill loaded
- For improved skills: run with the previous version (snapshot it first)
Save timing data to timing.json in each run directory:
{
"total_tokens": 84852,
"duration_ms": 23332,
"total_duration_seconds": 23.3
}
Step 4: Grade Results
Delegate to agents/skill-forge-grader.md for each completed run:
- Grade against assertions defined in eval_metadata.json
- Save results to
grading.jsonper run:
{
"eval_id": 0,
"run_type": "with_skill",
"assertions": [
{
"name": "output-has-score",
"passed": true,
"evidence": "Found score: 87/100 on line 14"
}
],
"pass_rate": 1.0
}
Step 5: Aggregate and Analyze
-
Run
python scripts/aggregate_benchmark.py <workspace>/iteration-<N> --skill-name <name> -
This produces
benchmark.jsonandbenchmark.mdwith:- Pass rate per eval (with_skill vs baseline)
- Average time and token usage
- Improvement ratio (with_skill / baseline)
-
Delegate to
agents/skill-forge-analyzer.mdto:- Surface patterns that aggregate stats might hide
- Identify consistently failing assertion types
- Flag regressions from previous iterations
Step 6: Present Results
Generate a summary report:
# Eval Report: [skill-name] — Iteration [N]
## Overall
| Metric | With Skill | Baseline | Delta |
|--------|-----------|----------|-------|
| Pass Rate | X% | Y% | +Z% |
| Avg Time | Xs | Ys | -Zs |
| Avg Tokens | X | Y | -Z |
## Per-Eval Results
| Eval | With Skill | Baseline | Status |
|------|-----------|----------|--------|
| eval-0 | PASS | FAIL | Improved |
| eval-1 | PASS | PASS | Maintained |
## Patterns & Insights
[From analyzer agent]
## Recommendations
[Specific improvements based on failures]
Step 7: Collect Feedback
Save user feedback to feedback.json:
{
"reviews": [
{
"run_id": "eval-0-with_skill",
"feedback": "the chart is missing axis labels",
"timestamp": "2026-03-06T12:00:00Z"
}
],
"status": "complete"
}
Pass feedback to /skill-forge evolve for the next iteration.
Advanced: Blind Comparison
For rigorous A/B testing between skill versions:
- Delegate to
agents/skill-forge-comparator.md - Pass two directories:
eval-<ID>/with_skill/outputs/andeval-<ID>/baseline/outputs/ - Comparator assigns random labels (Version A / Version B) so it cannot know which is new
- Rates each output on assertion criteria from
eval_metadata.json - Returns preference scores without knowing which is "new" vs "old"
Error Handling
- Executor timeout: If a run exceeds 5 minutes, terminate and mark as
"timed_out": truein timing.json - Executor failure: If a run crashes, save the error to
error.txtin the run directory and continue with remaining evals - Grading failure: If grading cannot determine pass/fail, mark assertion as
"passed": nullwith evidence explaining why - Missing files: If timing.json or grading.json is missing after a run, flag the eval as incomplete in the report
- Partial completion: Always aggregate and report whatever results are available — do not block on one failed eval
Quality Gates
Before marking an eval run as complete:
- All evals executed (with_skill + baseline)
- Timing data captured for every run
- All assertions graded with evidence
- Benchmark aggregated with pass rate, time, tokens
- Analyzer patterns documented
- Results presented to user