skill-ab-test
Skill A/B Test
Compare a modified skill against its baseline to measure whether changes actually improve agent behavior. Runs controlled tests with both versions, grades outputs against assertions, and produces a benchmark report.
When to Use
- After modifying a SKILL.md, references, or scripts
- Before committing skill changes to verify improvement
- When unsure if added instructions actually change behavior
- To identify regressions from refactoring
Prerequisites
- The skill must be in a git repo (baseline comes from the last commit)
- Node.js available (for the report generator)
- Changes should already be applied to the skill files (working tree = new version, git HEAD = baseline)
Step 1: Identify Skill and Changes
- Ask the user which skill to test (or detect from recent
git diff) - Determine the skill path (the directory containing SKILL.md)
- Run
git diff HEAD -- <skill-path>to see what changed - Summarize the changes for the user in a table:
| File | Change type | Summary |
|---|---|---|
| SKILL.md | Modified | Added section budget validation step |
| references/quality-gate.md | Modified | Added exclusions list |
| scripts/audit.js | Modified | New analyzeSectionBudgets function |
If there are no uncommitted changes, ask the user what they plan to change or if they want to compare two branches.
Step 2: Create Workspace and Snapshot Baseline
SKILL_PATH="<skill-path>"
SKILL_NAME="$(basename $SKILL_PATH)"
WORKSPACE="$(pwd)/${SKILL_NAME}-ab-workspace"
mkdir -p "$WORKSPACE/baseline" "$WORKSPACE/evals"
# Snapshot baseline from git HEAD
git show HEAD:"$SKILL_PATH" > /dev/null 2>&1 || echo "ERROR: skill not committed yet"
cd "$(git rev-parse --show-toplevel)"
git archive HEAD -- "$SKILL_PATH" | tar -x -C "$WORKSPACE/baseline/"
Verify baseline snapshot:
diff <(find "$WORKSPACE/baseline/$SKILL_PATH" -type f | sort) \
<(find "$SKILL_PATH" -type f | sort) || true
This shows which files differ between baseline and current.
Step 3: Design Eval Scenarios
Generate 3-5 eval scenarios that specifically target the changes. Each eval should:
- Test behavior the changes are supposed to improve
- Have 3-5 concrete, verifiable assertions
- Be realistic (based on actual use cases, not synthetic)
Auto-Generating from Diff
For each change in the diff, ask: "What would an agent do differently with vs without this change?" That's your eval.
Example: if you added a section budget validation step, write an eval where the input has sections exceeding the budget and assert the agent flags them.
Eval Format
Save to {workspace}/evals/evals.json:
{
"skill_name": "<skill-name>",
"evals": [
{
"id": 1,
"name": "descriptive-kebab-name",
"prompt": "Realistic task prompt that exercises the changed behavior",
"expected_output": "Human-readable description of what success looks like",
"assertions": [
{
"text": "Verifiable statement about the output",
"type": "output_contains_concept"
}
]
}
]
}
Present the evals to the user before running. Ask: "These are the test cases. Want to adjust any before I run them?"
Step 4: Run A/B Tests
For each eval, spawn two agents in parallel:
Variant A: With New Skill (current working tree)
You are being evaluated on a skill. Read the skill at {skill-path}/SKILL.md and follow it.
Your task: {eval prompt}
Save all outputs to {workspace}/iteration-{N}/{eval-name}/new_skill/outputs/
Save your full response to {workspace}/iteration-{N}/{eval-name}/new_skill/outputs/response.md
Variant B: With Baseline Skill (from snapshot)
You are being evaluated on a skill. Read the skill at {workspace}/baseline/{skill-path}/SKILL.md and follow it.
Your task: {eval prompt}
Save all outputs to {workspace}/iteration-{N}/{eval-name}/old_skill/outputs/
Save your full response to {workspace}/iteration-{N}/{eval-name}/old_skill/outputs/response.md
Execution Strategy
- Create eval_metadata.json for each eval:
{
"eval_id": 1,
"eval_name": "descriptive-name",
"prompt": "the eval prompt",
"assertions": [...]
}
- Launch ALL agents in a single turn for maximum parallelism
- Use background agents — launch all, then wait for notifications
- Each eval gets 2 agents (new_skill + old_skill) running simultaneously
Important: Both variants run in the same repo. The only difference is which SKILL.md they read. No files need to be hidden or moved.
Step 5: Grade Outputs
After all agents complete, grade each run.
Grading Process
For each run (new_skill and old_skill for each eval):
- Read the response at
outputs/response.md - Read any other output files in
outputs/ - Evaluate each assertion from eval_metadata.json against the outputs
- Save grading to
grading.json(sibling to outputs/):
{
"eval_id": 1,
"eval_name": "descriptive-name",
"config": "new_skill",
"expectations": [
{
"text": "Assertion text from eval",
"passed": true,
"evidence": "Specific quote or description from the output"
}
],
"summary": {
"passed": 3,
"failed": 1,
"total": 4,
"pass_rate": 0.75
}
}
Grading with Subagent
Spawn a grading subagent using the prompt from agents/grader.md:
Grade the following eval runs. For each run directory, read the outputs and evaluate assertions.
Runs to grade:
{list of run directories with their assertions}
For each run, save grading.json as specified in agents/grader.md.
Then write a summary comparing new_skill vs old_skill across all evals.
Automated Assertion Checks
Where possible, use scripted checks before the grading subagent:
# Check if response mentions a concept
grep -ci "section budget" outputs/response.md
# Check if a file was created
test -f outputs/audit-report.json && echo "PASS" || echo "FAIL"
Step 6: Generate Benchmark
Build {workspace}/iteration-{N}/benchmark.json:
{
"metadata": {
"skill_name": "<skill-name>",
"skill_path": "<skill-path>",
"timestamp": "<ISO timestamp>",
"evals_run": ["eval-name-1", "eval-name-2"],
"model": "<model used>",
"changes_summary": "Brief description of what changed"
},
"runs": [
{
"eval_name": "descriptive-name",
"new_skill": { "pass_rate": 1.0, "assertions_passed": 4, "assertions_total": 4 },
"old_skill": { "pass_rate": 0.75, "assertions_passed": 3, "assertions_total": 4 }
}
],
"summary": {
"new_skill": { "mean_pass_rate": 0.92, "total_passed": 11, "total_assertions": 12 },
"old_skill": { "mean_pass_rate": 0.75, "total_passed": 9, "total_assertions": 12 },
"delta": { "pass_rate": "+0.17", "assertions": "+2" },
"verdict": "improvement"
},
"notes": [
"All old_skill failures trace to missing section budget validation",
"No regressions detected"
]
}
Verdict Logic
- improvement: new_skill pass_rate > old_skill pass_rate AND no regressions
- regression: new_skill pass_rate < old_skill pass_rate on any eval
- no_change: identical pass rates
- mixed: improvement on some evals, regression on others
A regression is any assertion that passes with old_skill but fails with new_skill.
Step 7: Launch Report Viewer
Launch the interactive report viewer with feedback collection:
node {this-skill-path}/scripts/generate-report.js {workspace}/iteration-{N} --serve
This starts a local server (default port 3118) that:
- Shows the benchmark results with side-by-side comparison
- Provides per-eval feedback textareas (auto-saves on typing)
- Saves feedback to
{workspace}/iteration-{N}/feedback.json - Has a "Done Reviewing" button that signals review completion
For iteration 2+, pass the previous iteration to show prior feedback:
node {this-skill-path}/scripts/generate-report.js {workspace}/iteration-2 --serve --previous {workspace}/iteration-1
Static mode (no server, feedback downloads on submit):
node {this-skill-path}/scripts/generate-report.js {workspace}/iteration-{N}
open {workspace}/iteration-{N}/report.html
Report Contents
The report has two tabs:
- Results — summary table with pass rates, deltas, verdict, notes
- Per-Eval Details — assertion-level comparison with evidence, previous feedback, and feedback textarea per eval
Tell the user: "Review the report in your browser. Leave feedback on any eval, then click 'Done Reviewing' when finished."
Reading Feedback
After the user finishes reviewing, read {workspace}/iteration-{N}/feedback.json:
{
"reviews": [
{ "eval_name": "eval-1", "feedback": "The budget detection was good but missed nested files", "timestamp": "..." },
{ "eval_name": "eval-2", "feedback": "", "timestamp": "..." }
],
"status": "complete"
}
status: "complete"means the user clicked "Done Reviewing"- Empty
feedbackmeans the eval looked good (no issues flagged) - Non-empty
feedbackcontains specific issues, corrections, or suggestions
Use this feedback to guide iteration (Step 9).
Step 8: Recommend Next Action
Based on the verdict:
Improvement (no regressions)
Your changes improved the skill. New skill scores {X}% vs baseline {Y}% (+{delta}%). No regressions detected. Safe to commit.
Regression detected
Warning: {N} regressions found. These assertions passed with the old skill but fail with the new one:
- [eval-name] "assertion text" — old: PASS, new: FAIL
Consider reverting the change that caused this, or adjusting the new skill to preserve the old behavior.
No change
Your changes didn't measurably affect behavior on these test cases. The skill scores identically ({X}%) in both versions. Consider:
- Are the evals testing the right things?
- Are the changes too subtle to measure with these assertions?
- Are the changes beneficial for maintainability even without behavioral change?
Mixed results
Mixed results: improvement on {N} evals, regression on {M} evals. Review the per-eval breakdown to decide if the trade-off is acceptable.
Step 9: Iterate Using Feedback
After the user clicks "Done Reviewing", read {workspace}/iteration-{N}/feedback.json.
Process Feedback
For each eval with non-empty feedback:
- Wrong grades: If the user says a grade is wrong (e.g., "this should be PASS"), update the grading.json and re-generate the benchmark
- Skill issues: If the user flags something the skill should handle differently, propose a specific change to the skill files
- Eval issues: If the user says an eval is weak or testing the wrong thing, revise the eval in evals.json
- Observations: If the user notes patterns (e.g., "both variants missed X"), consider whether the skill needs a new instruction
Re-run
After applying changes:
- Make additional changes to the skill
- Re-run from Step 4 into
iteration-{N+1}/ - Compare against the same baseline (not the previous iteration)
- Launch the viewer with
--previousto show prior feedback:node {this-skill-path}/scripts/generate-report.js {workspace}/iteration-{N+1} --serve --previous {workspace}/iteration-{N} - The user sees their previous feedback alongside new results — they can verify issues were addressed
Workspace Layout
{skill-name}-ab-workspace/
baseline/ # Git HEAD snapshot of the skill
{skill-path}/
SKILL.md
references/
scripts/
evals/
evals.json # Eval definitions
iteration-1/
{eval-name}/
eval_metadata.json
new_skill/
outputs/
response.md
...
grading.json
old_skill/
outputs/
response.md
...
grading.json
benchmark.json
feedback.json # User feedback from reviewer
report.html
report.md
iteration-2/
...
Common Pitfalls
-
Testing non-behavioral changes: If you only reformatted SKILL.md or merged reference docs, the A/B test may show no delta. That's expected — structural improvements don't always change behavior.
-
Too few evals: 1-2 evals give noisy results. Aim for 3-5 focused evals that each target a specific change.
-
Overlapping assertions: If multiple assertions test the same thing in different words, you get inflated scores. Each assertion should test a distinct behavior.
-
Baseline drift: Always snapshot from git HEAD, not from a previous workspace. If you've made commits between iterations, the baseline changes.
-
Agent variability: LLMs are non-deterministic. A single run per variant is suggestive, not conclusive. For high-stakes decisions, run 3x per variant and look at the pattern.
-
Forgetting to check regressions: A +20% improvement means nothing if there's a -50% regression on one eval. Always check per-eval, not just aggregate.