autoresearch
Five Invariants (never violate)
- Single mutable surface — one hypothesis per iteration, one change per experiment
- Fixed eval budget — eval runs in bounded time, no network calls in gates
- One scalar metric — composite score drives keep/discard, not vibes
- Binary keep/discard — improved = keep, else revert
git reset --hard HEAD~1 - Git-as-memory — every experiment is a commit, discards are reverts, history is the log
Safety rules
- Never modify
.lab/contents during hypothesis implementation - Never skip eval — every commit must be evaluated before keep/discard
- Always revert on crash —
atexithandler restores git state - Runner uses subscription auth (
claude -pwith ANTHROPIC_API_KEY stripped)
Autoresearch
Scaffold and run autonomous code improvement loops in any git repo. The pattern: generate a hypothesis via claude -p, implement it, run programmatic eval gates, keep if the composite score improves, discard if it doesn't. Proven across 50+ iterations on two codebases (shadow-engine: 0.69 to 1.0, perplexity-clone: search quality optimization).
Category
Runbooks — mechanical process with clear steps, not cognitive reasoning.
Quick Start
/autoresearch init # scaffold .lab/ in your repo
/autoresearch run # start the loop (default: 50 iterations)
/autoresearch status # check progress
/autoresearch resume # recover interrupted run
Command Dispatch
Parse $ARGUMENTS and route:
| Argument | Action |
|---|---|
init |
Run scaffold workflow (see Init below) |
eval-gen |
Regenerate eval gates from repo analysis |
run [--max-iterations N] [--dry-run] |
Launch the autoresearch loop |
status |
Show composite, timeline, convergence signals |
resume |
Detect .lab/, present state, ask resume or fresh |
| (empty) | Show help text with available commands |
Init Workflow (/autoresearch init)
- Verify
.git/exists in current directory - Run stack detection:
python3 ~/.claude/skills/autoresearch/scripts/detect_stack.py - Review the detected stack info (language, build_cmd, test_cmd, lint_cmd)
- Run the scaffold script:
python3 ~/.claude/skills/autoresearch/scripts/scaffold.py --repo-root . --yes - Review
.lab/config.json— adjustkeep_threshold,max_iterations,gate_weightsif needed - Edit
.lab/program.md— this is the most important file. Add:- Specific areas to improve (not vague goals)
- Concrete hypothesis list (ranked)
- Constraints the agent must respect
- Run baseline eval to verify gates work:
python3 .lab/eval.py - Report the initial composite to the user
If .lab/ already exists, ask the user: resume existing lab, or archive to .lab.bak.<timestamp>/ and start fresh?
Eval-Gen Workflow (/autoresearch eval-gen)
Regenerate eval gates without re-scaffolding everything:
python3 ~/.claude/skills/autoresearch/scripts/eval_gen.py --repo-root . --output .lab/eval.py
Review the generated gates. The user may want to:
- Add custom gates for domain-specific behavior
- Adjust tier weights in
.lab/config.json - Add behavioral gates that test specific CLI invocations or API endpoints
Gates follow a 4-tier architecture:
| Tier | Weight | What it measures | Anti-cheat |
|---|---|---|---|
| T1: Build+Test | 0.20 | Compiles, tests pass, lint clean | Runs real commands, sums pass counts |
| T2: Behavioral | 0.40 | Integration tests, CLI output, API responses | Validates content, not file existence |
| T3: Pipeline | 0.25 | Build artifacts, installs, real I/O | File size >1KB, header validation |
| T4: Documentation | 0.15 | Test count floor, doc coverage | Counts code, never trusts comments |
Run Workflow (/autoresearch run)
python3 .lab/runner.py --max-iterations 50
Or for a dry run (prints hypothesis, creates no files):
python3 .lab/runner.py --dry-run --max-iterations 1
Monitor progress in a separate terminal:
tail -f .lab/results.tsv
The runner:
- Loads config from
.lab/config.json - Reads program.md for constraints and hypothesis direction
- Creates an
autoresearch/{date}branch - Loops: hypothesis via
claude -p-> implement viaclaude -p-> git commit -> eval -> keep/discard - Logs every experiment to
.lab/results.tsvwith extended statuses:
| Status | Meaning |
|---|---|
KEEP |
Composite improved >= keep_threshold |
KEEP* |
Primary improved but secondary metric regressed |
DISCARD |
No improvement, reverted |
INTERESTING |
Negative result that reveals structure, logged to dead-ends |
CRASH |
Eval infrastructure failure, reverted |
TIMEOUT |
Experiment exceeded timeout, logged as crash |
- Checks 9 convergence signals after each experiment (see
references/convergence-signals.md) - Re-validates baseline every 10 real experiments
- Auto-generates
.lab/eval-report.mdwith cumulative progress
Status Workflow (/autoresearch status)
python3 ~/.claude/skills/autoresearch/scripts/report.py --repo-root .
Shows: composite (live), experiment timeline, keeps/discards/crashes, active convergence signals, branch genealogy, dead-ends.
Resume Workflow (/autoresearch resume)
- Check if
.lab/exists - If yes: read
config.json,results.tsv, tail oflog.md - Present summary: objective, metrics, experiment count, current best vs baseline, last status
- Ask: resume (continue from last experiment) or fresh (archive
.lab.bak.<timestamp>/) - If resume: check for stale lock file, clean up if needed, then run
.lab/ Directory Layout
.lab/ # gitignored — experiment knowledge store
config.json # All parameters (repo_name, build_cmd, keep_threshold, etc.)
runner.py # Customized runner (from runner_template.py)
eval.py # Generated + user-extended eval gates
eval_base.py # Base framework (gate registration, composite scoring)
program.md # Human-maintained constraints + priorities
results.tsv # Experiment log (experiment_id, branch, parent, commit,
# composite, status, duration_s, description)
log.md # Narrative per-experiment entries
branches.md # Branch registry
dead-ends.md # Falsified approaches + why they failed
parking-lot.md # Deferred ideas for later
eval-report.md # Auto-generated cumulative report
runner-*.log # Runner stdout/stderr logs
.runner.lock # PID lock file (prevents concurrent runs)
Why .lab/ not autoresearch/: Code state (git) and experiment knowledge (.lab/) are fully decoupled. git reset --hard HEAD~1 (the core discard mechanic) never touches .lab/. Results survive branch operations.
Three-Tier Output Protocol
Eval gates emit structured diagnostics to stderr:
GATE build=PASS # Binary — blocks iteration on FAIL
METRIC test_count=475 # Continuous — tracked in results.tsv
TRACE gate_duration_ms=3200 # Execution data — for debugging only
Scripts Reference
| Script | Purpose | Run from |
|---|---|---|
scripts/detect_stack.py |
Detect language, build system, test runner | Skill dir |
scripts/scaffold.py |
Create .lab/ with all files |
Skill dir |
scripts/eval_gen.py |
Generate adversarial eval gates | Skill dir |
scripts/report.py |
Render status report | Skill dir |
scripts/runner_template.py |
Template copied to .lab/runner.py |
Skill dir |
assets/eval_base.py |
Base eval framework copied to .lab/ |
Skill dir |
assets/config.json.tmpl |
Config template with documented fields | Skill dir |
assets/program.md.tmpl |
Program.md template | Skill dir |
All scripts run with python3 (no special dependencies). Use uv run if preferred.
Gotchas
ANTHROPIC_API_KEYin environment: The runner strips it soclaude -puses subscription auth (not pay-per-use API). If you want API auth, setuse_api_key: truein config.json.- Gate stochasticity: If gates produce different scores on the same code, the runner will thrash between keep/discard. All gates must be deterministic.
- Large dt on resume: If the machine suspends during a run, the runner handles it gracefully via atexit + lock file cleanup.
- Eval crashes vs gate crashes: An eval crash (eval.py itself fails) aborts the iteration. A gate crash (one gate throws) is logged in
crashed_gatesand excluded from composite.
Post-Run Checklist
After every autoresearch run:
tail -f .lab/results.tsv— review keeps/discards- Read
.lab/eval-report.mdfor cumulative progress and ceiling detection - Merge the autoresearch branch to main if satisfied
- Update
.lab/program.mddead ends with falsified approaches - Run
python3 .lab/eval.pyto confirm final composite
Never
- Never modify
.lab/eval_base.pyor.lab/runner.pyduring a run - Never run two runners concurrently (lock file prevents this, but don't bypass)
- Never commit
.lab/to git (it's gitignored for a reason) - Never trust a composite that includes crashed gates
More from tdimino/claude-code-minoan
academic-research
Search academic papers, build literature reviews, and synthesize research findings — combines Exa MCP (research_paper category, arxiv filtering) with arxiv-mcp-server for paper discovery, download, and deep analysis. Triggers on academic paper, literature review, research synthesis, arxiv, find papers, scholarly search.
69travel-requirements-expert
Plan a trip, create an itinerary, or research a destination through a structured 5-phase workflow---discovery questions, Exa/Firecrawl research, expert detail gathering, and a day-by-day requirements spec. This skill should be used when a user says "plan a trip," "create an itinerary," "help me visit [place]," or needs travel research with specific venues, safety protocols, and dietary accommodations.
67twilio-api
Use this skill when working with Twilio communication APIs for SMS/MMS messaging, voice calls, phone number management, TwiML, webhook integration, two-way SMS conversations, bulk sending, or production deployment of telephony features. Includes official Twilio patterns, production code examples from Twilio-Aldea (provider-agnostic webhooks, signature validation, TwiML responses), and comprehensive TypeScript examples.
65figma-mcp
Convert Figma designs into production-ready code using MCP server tools. Use this skill when users provide Figma URLs, request design-to-code conversion, ask to implement Figma mockups, or need to extract design tokens and system values from Figma files. Works with frames, components, and entire design files to generate HTML, CSS, React, or other frontend code.
61firecrawl
Scrape web pages to clean markdown using Firecrawl v2 — handles JS-heavy pages, site crawls, URL mapping, document parsing (PDF/DOCX/XLSX), LLM-powered extraction, autonomous agent scraping, and post-scrape browser interaction (Interact API). Prefer over WebFetch for quality and completeness. Triggers on scrape URL, fetch page, crawl site, extract content, parse document, web to markdown, DeepWiki, Firecrawl.
51scrapling
Scrape pages locally with anti-bot bypass, TLS impersonation, and adaptive element tracking — no API keys, no cloud. Handles Cloudflare protection, CSS/XPath element extraction, and survives site redesigns. Complements firecrawl (cloud) with 100% local execution. Triggers on Cloudflare bypass, anti-bot scraping, stealth fetch, local scraping, Scrapling.
47