skill-eval
Skill Eval
Eval-driven development for skills: draft, scaffold evals, test, grade, review, improve, repeat.
Your job is to figure out where the user is in this process and jump in. Maybe they want to test an existing skill. Maybe they want to optimize a description. Maybe they want the full ship pipeline. Be flexible — if the user doesn't want evals, skip them.
Environment Detection
Before starting, detect the environment:
- Claude Code — check
which claude. Full capabilities: subagents, browser,claude -pfor trigger testing. - Cowork — check for Task tools. Subagents work, but no display. Use
--static <path>for HTML viewer output. - Claude.ai — fallback. Inline testing only. Skip benchmarking and description optimization (requires
claudeCLI).
Adapt workflows below based on detected environment.
Quick Start
Three steps to evaluate any skill:
-
Scaffold evals from the skill's trigger queries:
python3 scripts/eval_scaffold.py --skill-path <skill-dir> -
Initialize and run the workspace:
python3 scripts/eval_workspace.py init <skill-dir> python3 scripts/eval_workspace.py run <skill-dir> -
Review results in the interactive viewer:
python3 scripts/generate_review.py <skill-dir>/.skill-eval --skill-name <name>
Eval Workspace
The .skill-eval/ directory lives inside the skill. It contains sequential numbered runs, eval definitions, and a manifest tracking history.
.skill-eval/
├── manifest.json # Tracks all runs, pinned baseline
├── evals/ # .eval.yaml test case files
│ ├── trigger-query-1.eval.yaml
│ └── negative-test.eval.yaml
└── runs/
├── 001/ # Sequential run directories
│ ├── outputs/ # Files created during eval
│ ├── transcript.md # Execution transcript
│ ├── grading.json # Tier 1 + Tier 2 grades
│ ├── timing.json # Tokens, duration (capture immediately!)
│ ├── metrics.json # Tool calls, errors, file counts
│ └── skill-snapshot.md # SKILL.md at time of run
└── 002/
Commands:
eval_workspace.py init <skill>— Create workspaceeval_workspace.py run <skill>— Create next run dir with snapshoteval_workspace.py compare <skill> [--run-a N] [--run-b M]— Compare runseval_workspace.py clean <skill> --keep-last N— Prune old runseval_workspace.py pin <skill> [run_id]— Pin regression baselineeval_workspace.py regress <skill>— Compare against pinned baseline
For benchmark mode, use nested layout: runs/NNN/eval-E/{with_skill,without_skill}/run-R/
Defining Test Cases
Test cases are .eval.yaml files in .skill-eval/evals/. Each has two tiers:
Tier 1 — Checks (programmatic, zero tokens):
checks:
- type: file_exists
target: "output.pdf"
- type: contains
target: "output.txt"
expected: "Summary"
Check types: file_exists, regex, json_valid, yaml_valid, exit_code, contains, not_contains, line_count_range, file_size_range.
Tier 2 — Assertions (agent-graded):
assertions:
- "The report includes a revenue trend chart"
- "The summary table has correct quarterly totals"
Assertions run ONLY if all Tier 1 checks pass. This saves 60-80% of grading tokens by catching obvious failures early.
See references/eval-format.md for the complete format spec with category-specific examples.
Two-Tier Grading
Run the grader on eval results:
python3 scripts/eval_grader.py \
--eval-file <.eval.yaml> \
--outputs-dir <outputs/> \
--transcript <transcript.md> \
--metrics <metrics.json> \
--timing <timing.json>
Tier 1: Runs all programmatic checks instantly. If any fail, Tier 2 is skipped entirely.
Tier 2: Spawns a grader agent using references/agents/grader.md as system prompt. The grader:
- Evaluates each assertion with evidence (PASS/FAIL)
- Extracts and verifies implicit claims from output
- Reads executor's user_notes from metrics.json
- Critiques eval quality (flags non-discriminating assertions)
- Outputs grading.json with all fields per
references/schemas.md
Running Evaluations
Trigger Testing
Test whether the skill activates for realistic queries:
python3 scripts/run_eval.py \
--eval-set <queries.json> \
--skill-path <skill-dir> \
--runs-per-query 3 \
--timeout 30
Full Evaluation (with subagents)
For each test case, spawn two subagents in the same turn:
- With-skill run: Execute the eval prompt with the skill loaded, save outputs
- Baseline run: Same prompt, no skill (or old skill version)
Critical: When subagent tasks complete, the notification includes total_tokens and duration_ms. Save to timing.json IMMEDIATELY — this data is NOT persisted elsewhere.
Capture execution metrics (tool_calls, errors, user_notes) from the executor into metrics.json.
Reviewing Results
Terminal Diff
Quick comparison between runs:
python3 scripts/eval_diff.py --workspace <.skill-eval/>
Shows: SKILL.md section changes, pass/fail deltas, regressions (PASS→FAIL flagged prominently), timing changes.
Interactive HTML Viewer
Deep review with Outputs + Benchmark tabs:
python3 scripts/generate_review.py <workspace> \
--skill-name <name> \
--benchmark <benchmark.json>
For headless/Cowork: add --static <output.html>. For iteration comparison: add --previous-workspace <prev>.
Wait for user feedback before improving the skill.
Token Budget Analysis
Understand the skill's context cost:
python3 scripts/token_budget.py --skill-path <skill-dir>
Reports three levels:
- Level 1 (metadata): Name + description tokens (always loaded)
- Level 2 (body): Per-section token breakdown (loaded on trigger)
- Level 3 (references): Per-file token cost (loaded on demand)
Flags heavy sections and recommends extraction to references.
Improving the Skill
After reviewing results and user feedback:
- Generalize from feedback — changes should help the category, not just one test case. Avoid overfitty constraints.
- Explain the why — reframe MUSTs as reasoning. Theory of mind over rigid instructions.
- Keep prompts lean — remove content that isn't pulling its weight. Read transcripts to spot wasted work.
- Extract repeated work — if all test runs wrote similar scripts, bundle them:
python3 scripts/extract_scripts.py \ --transcripts ".skill-eval/runs/*/transcript.md" \ --output-dir candidates/
Review eval feedback in grading.json (eval_feedback.suggestions) before improving — fixing the measurement is higher priority than tuning the skill.
Regression Testing
Pin a known-good baseline, then compare future runs against it:
# Pin current run
python3 scripts/eval_workspace.py pin <skill-dir>
# After making changes, check for regressions
python3 scripts/eval_workspace.py regress <skill-dir>
The regression check compares pass/fail per expectation and flags any PASS→FAIL transitions prominently.
Description Optimization
Optimize the skill's description for better triggering accuracy:
Step 1: Generate trigger queries
Create 20 realistic queries — mix of should-trigger (8-10) and should-not-trigger (8-10). Save as JSON:
[
{"query": "realistic user prompt with detail", "should_trigger": true},
{"query": "near-miss that shares keywords", "should_trigger": false}
]
Queries should be substantive (not "read this file"), include detail (file paths, context), and test the boundary (near-misses for negatives).
Step 2: Review with user
Present queries using the HTML editor:
- Read
assets/eval_review.html - Replace
__EVAL_DATA_PLACEHOLDER__with the JSON array - Replace
__SKILL_NAME_PLACEHOLDER__and__SKILL_DESCRIPTION_PLACEHOLDER__ - Write to temp file, open in browser
- User edits and exports
eval_set.json
Step 3: Run optimization loop
python3 scripts/run_loop.py \
--eval-set <eval_set.json> \
--skill-path <skill-dir> \
--model <model-id> \
--max-iterations 5
This handles: stratified 60/40 train/test split, parallel evaluation, extended thinking for description improvement, live HTML report, best description selected by TEST score (prevents overfitting).
Step 4: Apply result
Take best_description from the output and update SKILL.md frontmatter.
Advanced: Blind Comparison
For rigorous A/B comparison between skill versions:
- Run two versions in parallel — each gets the same eval prompts
- Randomly assign as A/B — prevents order bias
- Spawn comparator agent with
references/agents/comparator.md— generates task-specific rubric (content: correctness/completeness/accuracy + structure: organization/formatting/usability, 1-5 scale), scores both, picks winner → comparison.json - Spawn analyzer agent with
references/agents/analyzer.md— reads both skills + transcripts + comparison, identifies winner strengths + loser weaknesses + prioritized improvements → analysis.json
This is optional and requires subagents. The human review loop is usually sufficient.
Ship Pipeline
One-command quality gate + packaging:
- Validate —
skills-ref validate(orquick_validate.py) - Eval — Run full eval suite. All Tier 1 checks must pass, 80%+ Tier 2 assertions
- Quality gate:
- SKILL.md under 500 lines
- No TODO markers in content
- All referenced files exist
- Token budget under threshold
- Package — Generate
.skillfile viapackage_skill.py - Summary card — Report: skill name, version, pass rate, token budget, package size
Reference Materials
Consult these based on your needs:
- references/schemas.md — All JSON schemas (eval, grading, benchmark, comparison, analysis, manifest)
- references/eval-methodology.md — Philosophy, two-tier rationale, iteration guidance, multi-environment support, subagent patterns, extended thinking
- references/eval-format.md — .eval.yaml specification, check types reference, category-specific recommendations
- references/agents/grader.md — Grader agent: 8-step process for assertions, claims, eval critique
- references/agents/comparator.md — Blind A/B comparison with task-specific rubrics
- references/agents/analyzer.md — Post-hoc analysis + benchmark pattern surfacing