Autoresearch for Skills

Most skills work about 70% of the time. The other 30% you get garbage. The fix isn't to rewrite the skill from scratch. It's to let an agent run it dozens of times, score every output, and tighten the prompt until that 30% disappears.

This skill adapts Andrej Karpathy's autoresearch methodology (autonomous experimentation loops) to Claude Code skills. Instead of optimizing ML training code, we optimize skill prompts.

the core job

Take any existing skill, define what "good output" looks like as binary yes/no checks, then run an autonomous loop that:

Generates outputs from the skill using test inputs
Scores every output against the eval criteria
Mutates the skill prompt to fix failures
Keeps mutations that improve the score, discards the rest
Repeats until the score ceiling is hit or the user stops it

Output: An improved SKILL.md + results.tsv log + changelog.md of every mutation attempted + a live HTML dashboard you can watch in your browser.