Skill Autoresearch

Use this skill when the job is improving a reusable repo artifact with a frozen local benchmark.

The contract is simple: first decide whether a ratchet is justified at all, then freeze the judge, baseline the current artifact, change one meaningful thing only when needed, rerun the same harness, keep only measured improvements, and log both wins and reverts.

This skill is intentionally repo-local. It owns markdown/git-friendly ratchets for SKILL.md, SOPs, prompts, templates, and workflow docs. If the real need is product-scale traces, hosted dashboards, or app/runtime evaluation, route out instead of pretending this skill replaces LangSmith, Braintrust, Weave, or Promptfoo.

Read these support files before editing:

When to use this skill

A reusable SKILL.md or workflow document works inconsistently and needs a bounded improvement loop.
You need to decide whether the current artifact even deserves another ratchet, or whether no ratchet justified is the correct outcome.
You want to tighten triggers, route-outs, or execution steps without moving the benchmark mid-run.
You need to add or refresh references/, evals/, compact variants, or discovery wording only after the main boundary is proven.
You need append-only keep/revert history that survives in git and PR review.
You want a repeatable answer to “did this edit actually improve the artifact?”

Do not use this skill when

The user already knows the exact rewrite they want and does not need a benchmark loop.
The target has no representative prompts or no stable way to evaluate behavior.
The work is really about running GPU-bound karpathy/autoresearch experiments on train.py / program.md / val_bpb → use autoresearch.
The work is really about hosted prompt or app evaluation, production traces, large datasets, or experiment dashboards → route to LangSmith, Promptfoo, Braintrust, or Weave.
You are about to change the artifact and the evaluator at the same time.

Required inputs

Do not start mutation work until you know the target artifact, 3-5 representative prompts or scenarios, 3-6 binary evals, a rerun/budget rule, and which supporting files are allowed beyond the primary artifact.

Before that, decide whether there is evidence for one of three outcomes:

the baseline likely fails and needs a real ratchet,
the main artifact is fine but support surfaces drifted,
or no ratchet is justified yet.

Instructions

Step 1: Choose one packet

Normalize the request into one primary packet before editing anything.

skill_autoresearch_packet:
  primary_packet: ratchet-eligibility | benchmark-readiness | charter-freeze | baseline-score | one-change-mutation | support-sync | final-report | route-out
  target_artifact: SKILL.md | SOP | prompt-template | workflow-doc | other
  evidence_shape: prompts-and-evals | dry-run-checklist | repo-validators | mixed | unknown
  support_scope: none | references-only | evals-only | compact-only | discovery-surfaces | mixed
  confidence: high | medium | low

Packet meanings:

ratchet-eligibility — decide whether the run should stop as no ratchet justified, jump to support-sync, or continue into a real benchmark loop
benchmark-readiness — the loop cannot start until prompts/evals/scope are frozen
charter-freeze — write the loop contract before any mutation
baseline-score — snapshot the current artifact and record experiment 0
one-change-mutation — make exactly one meaningful change, rerun, keep or revert
support-sync — update compact/docs/manifests only after the core ratchet is justified or when the main artifact is already good and only support surfaces drifted
final-report — summarize baseline → final delta, keep/revert count, and remaining failures
route-out — the request actually belongs to hosted eval tooling or ML autoresearch

Step 2: Read the target and neighboring surfaces

Read the primary artifact first, then only the support surfaces that matter:

linked references/, evals/, scripts/, or compact variants
README/setup/manifest wording if discoverability may change
prior loop artifacts if a ratchet already exists

Capture:

the artifact's real job
what should trigger it
what it should route out
current failure modes
any stale discovery wording or support drift
whether the likely next state is no ratchet justified, support-sync, or a real mutation loop

Step 3: Decide ratchet eligibility before freezing the loop

Before writing a charter, answer three things:

Does the baseline already appear to satisfy the current bar?
Is the real problem only support-surface drift?
Is there concrete evidence that a ratchet is still worth the churn?

If the answers point to no ratchet justified, stop and report that outcome. If the main artifact is already good but docs/manifests/compact surfaces drifted, route to support-sync instead of pretending a mutation loop happened. Only continue into the benchmark loop when the baseline genuinely fails or when there is explicit evidence-backed headroom worth pursuing.

Step 4: Freeze the evaluator

Before editing, write loop-charter.md.

The charter must freeze:

goal of the run
current baseline
one primary mutable artifact
fixed evaluation harness
supporting files allowed in scope
time / iteration / tool budget
rejected directions for this run

Rules:

do not change prompts, eval wording, or scoring rules mid-run
if the evaluator changes, start a new comparison track
keep one primary mutable artifact even if supporting files change later

Step 5: Build binary evals only

Use references/eval-guide.md.

Strong eval categories for skill and workflow artifacts:

trigger precision
route-out clarity
execution determinism
artifact usefulness
benchmark discipline
discovery-surface sync (only if the run changes positioning)

Rules:

use yes/no checks only
prefer observable checks over taste
do not score the same failure twice
use the same prompt set for baseline and mutation runs

Step 6: Baseline the current artifact

copy the current version to a baseline artifact such as SKILL.md.baseline
record experiment 0
score the current version on the frozen harness
summarize the failures before mutating anything

Decision gate:

If the baseline already passes comfortably and there is no material support drift, stop and report no ratchet justified.
If the baseline is good and only compact/docs/manifests drifted, jump to support-sync and say so explicitly.
Only continue to mutation when the baseline actually fails or the charter names clear evidence-backed headroom worth pursuing.

Step 7: Run one-change mutations

This is the core loop:

inspect the failing outputs or artifact surfaces
form one hypothesis
change one meaningful thing only
rerun the same harness
compare against the baseline and current best
keep only score-improving changes
revert ties or regressions unless the charter explicitly prefers a secondary metric such as lower ambiguity or smaller front-door size
append the outcome to the run log

Good mutations:

tighten a weak trigger description
move a critical boundary rule higher
add one focused support file that closes a clear usability gap
remove a noisy instruction that causes over-triggering
shrink a bloated front door by moving stable detail into a reference

Bad mutations:

rewriting the skill and evaluator together
broad multi-file churn before the core boundary is proven
keyword stuffing for recall
optimizing for style instead of measured behavior

Step 8: Sync support surfaces only after the ratchet holds

If the main change materially affects discoverability, onboarding, naming, or usage, then sync:

evals/evals.json
compact variants such as SKILL.toon
discovery manifests such as skills.json / compact indexes
docs/setup surfaces such as README.md, localized README entries, or setup prompts

Do this after the main artifact improvement is justified by the frozen evaluator.

Step 9: Log every experiment

Use append-only artifacts. Minimum package:

loop-charter.md
baseline copy
structured results (results.tsv, results.json, or equivalent)
plain-language changelog / experiment log

Every experiment should record:

keep or revert
score delta
one-sentence mutation summary
hypothesis
remaining failures

Step 10: Apply the rule to this skill too

If the target artifact is skill-autoresearch itself, do not exempt it from the loop.

Validate:

the target skill against the frozen rubric
support-surface sync only after the main ratchet holds
the final wording still preserves frozen evaluators, one-change iterations, append-only logs, and explicit keep/revert decisions

Output format

Return a compact ratchet report:

# Skill Autoresearch Report

## Packet
- Primary packet:
- Target artifact:
- Why this packet fits:

## Frozen harness
- Prompts / scenarios:
- Binary evals:
- Validators:
- Scope limits:

## Baseline
- Current score:
- Main failures:

## Mutation result
- Change tried:
- Keep or revert:
- Score delta:
- Remaining failures:

## Support sync
- Updated surfaces:
- Deferred surfaces:

## Next state
- Recommended next move:
- Artifact paths:

Examples

Example 1: Trigger repair

Input

Improve this stale skill. It over-triggers and I want a bounded loop, not a rewrite.

Good direction

packet: one-change-mutation
freeze prompts/evals first
rewrite description or route-out wording only
keep or revert by measured result

Example 2: Benchmark drift repair

Input

I keep changing the tests while rewriting the skill. Help me optimize it anyway.

Good direction

packet: benchmark-readiness or charter-freeze
stop the run from proceeding until the evaluator is frozen
instruct the maintainer to start a new comparison track if the judge must change

Example 3: Discovery-surface follow-up

Input

The skill is better now. README, setup prompt, and compact wording may be stale.

Good direction

packet: support-sync
only after the main ratchet is proven, or because the main artifact is already fine and only support surfaces drifted
sync manifests/docs without pretending that broad doc churn was the main experiment

Example 4: No-ratchet outcome

Input

The current skill already passes our frozen checks and reviewer feedback is good. I just feel like tuning it more.

Good direction

packet: ratchet-eligibility
conclude no ratchet justified unless the maintainer can name concrete evidence-backed headroom
protect the passing baseline from churn

Example 5: Route-out

Input

I need online traces, dataset comparisons, and hosted dashboards for our app prompts.

Good direction

packet: route-out
explain that the job belongs to LangSmith, Promptfoo, Braintrust, or Weave rather than repo-local skill ratcheting

Best practices

Make no ratchet justified a valid success state instead of assuming every run needs mutation.
Freeze the evaluator before the first real edit.
Keep one primary mutable artifact even when support files are in scope.
Change one meaningful thing at a time.
Prefer representative prompts over perfect toy cases.
Log failed experiments, not just successes.
Treat smaller front doors as a valid secondary win only when the same harness still passes.
Sync compact/docs/manifests only after the main ratchet holds.
Keep the boundary sharp between repo-local skill maintenance, hosted eval platforms, and ML autoresearch.

skill-autoresearch

Skill Autoresearch

When to use this skill

Do not use this skill when

Required inputs

Instructions

Step 1: Choose one packet

Step 2: Read the target and neighboring surfaces

Step 3: Decide ratchet eligibility before freezing the loop

Step 4: Freeze the evaluator

Step 5: Build binary evals only

Step 6: Baseline the current artifact

Step 7: Run one-change mutations

Step 8: Sync support surfaces only after the ratchet holds

Step 9: Log every experiment

Step 10: Apply the rule to this skill too

Output format

Examples

Example 1: Trigger repair

Example 2: Benchmark drift repair

Example 3: Discovery-surface follow-up

Example 4: No-ratchet outcome

Example 5: Route-out

Best practices

References