Skill Autoresearch

Use this skill to improve another skill through measured iteration instead of gut feel.

The job is simple: run the target skill on a small test set, score outputs with binary evals, change one thing in the prompt, and keep only mutations that improve the score. Repeat until the score plateaus, the budget cap is hit, or the user stops the loop.

When to use this skill

A skill works inconsistently and needs a repeatable improvement loop
You want to benchmark a SKILL.md before editing it
You need binary evals for prompt or skill quality
You want a mutation log instead of ad-hoc rewriting
You want to compare baseline vs improved prompt behavior

Required inputs

Do not start experiments until all inputs below are known:

Target skill path
Three to five representative test inputs
Three to six binary yes/no evals
Runs per experiment, default 5
Experiment interval, default 2m
Optional budget cap

For writing reliable evals, read references/eval-guide.md.

Instructions

Step 1: Read the target skill

Read the target SKILL.md
Read any directly linked files under that skill's references/
Identify the core job, required steps, output format, and likely failure modes
Note buried instructions or conflicting rules before changing anything

Step 2: Build the eval suite

Convert the user's quality criteria into binary checks only.

Use this format:

EVAL 1: Short name
Question: Yes/no question about the output
Pass: Specific condition that counts as yes
Fail: Specific condition that counts as no

Rules:

Use binary yes/no checks only
Prefer observable checks over taste-based judgments
Keep evals distinct; do not double-count the same failure
Use three to six evals total

Step 3: Create the experiment workspace

Inside the target skill folder, create:

skill-autoresearch-[skill-name]/
  dashboard.html
  results.json
  results.tsv
  changelog.md
  SKILL.md.baseline

Requirements:

results.tsv stores experiment summaries
results.json powers the dashboard
dashboard.html is a self-contained status page
SKILL.md.baseline is the untouched original

Step 4: Establish the baseline

Run the target skill as-is before editing it.

Back up the original skill as SKILL.md.baseline
Run the skill N times on the same test inputs
Score every run against every eval
Record experiment 0 as the baseline
If baseline is already above 90 percent, confirm whether more optimization is worth it

Use this results.tsv header:

experiment	score	max_score	pass_rate	status	description

Step 5: Run the mutation loop

This is the core loop:

Inspect the failing outputs
Form one hypothesis about the failure
Make one targeted change to SKILL.md
Re-run the same test set
Score all outputs again
Keep the change only if the score improves
Revert ties or regressions
Append the result to results.tsv, results.json, and changelog.md

Good mutations:

Clarify an ambiguous instruction
Move a critical rule higher
Add one anti-pattern for a recurring failure
Add one focused example
Remove a noisy instruction that causes overfitting

Bad mutations:

Rewrite the whole skill at once
Add many rules in one experiment
Optimize for length instead of behavior
Use intuition instead of measured score

Step 6: Keep the dashboard live

The dashboard should refresh from results.json and show:

Experiment number
Score and pass rate progression
Baseline vs keep vs discard status
Per-eval failure hotspots
Current run state: running, idle, or complete

Use a single self-contained HTML file. Inline CSS/JS is preferred.

Step 7: Log every experiment

Append after every run:

## Experiment N — keep|discard

Score: X/Y
Change: one-sentence mutation summary
Reasoning: why this mutation was tried
Result: what improved or regressed
Remaining failures: what still breaks

Discarded experiments matter. They stop future agents from repeating dead ends.

Step 8: Deliver results

When the loop stops, report:

Baseline score to final score
Number of experiments run
Keep vs discard count
Top changes that helped most
Remaining failure patterns
Artifact locations

Rules

Do not run experiments before inputs and evals are defined
Use the same test set for baseline and mutations
Change one thing at a time
Keep or discard by score, not by preference
Record every attempt
Stop only on manual stop, budget cap, or clear score plateau

Output format

Expected artifacts:

skill-autoresearch-[skill-name]/
  dashboard.html
  results.json
  results.tsv
  changelog.md
  SKILL.md.baseline

The improved skill stays in place at its original path.