autoresearch

Installation

SKILL.md

Five Invariants (never violate)

Single mutable surface — one hypothesis per iteration, one change per experiment
Fixed eval budget — eval runs in bounded time, no network calls in gates
One scalar metric — composite score drives keep/discard, not vibes
Binary keep/discard — improved = keep, else revert git reset --hard HEAD~1
Git-as-memory — every experiment is a commit, discards are reverts, history is the log

Safety rules

Never modify .lab/ contents during hypothesis implementation
Never skip eval — every commit must be evaluated before keep/discard
Always revert on crash — atexit handler restores git state
Runner uses subscription auth (claude -p with ANTHROPIC_API_KEY stripped)

Autoresearch

Scaffold and run autonomous code improvement loops in any git repo. The pattern: generate a hypothesis via claude -p, implement it, run programmatic eval gates, keep if the composite score improves, discard if it doesn't. Proven across 50+ iterations on two codebases (shadow-engine: 0.69 to 1.0, perplexity-clone: search quality optimization).

Quick Start

/autoresearch init          # scaffold .lab/ in your repo
/autoresearch run           # start the loop (default: 50 iterations)
/autoresearch status        # check progress
/autoresearch resume        # recover interrupted run

Command Dispatch

Parse $ARGUMENTS and route:

Argument	Action
`init`	Run scaffold workflow (see Init below)
`eval-gen`	Regenerate eval gates from repo analysis
`run [--max-iterations N] [--dry-run]`	Launch the autoresearch loop
`status`	Show composite, timeline, convergence signals
`resume`	Detect `.lab/`, present state, ask resume or fresh
(empty)	Show help text with available commands

Init Workflow (`/autoresearch init`)

Verify .git/ exists in current directory

Run stack detection:

python3 ~/.claude/skills/autoresearch/scripts/detect_stack.py

Review the detected stack info (language, build_cmd, test_cmd, lint_cmd)

Run the scaffold script:

python3 ~/.claude/skills/autoresearch/scripts/scaffold.py --repo-root . --yes

Review .lab/config.json — adjust keep_threshold, max_iterations, gate_weights if needed
Edit .lab/program.md — this is the most important file. Add:
- Specific areas to improve (not vague goals)
- Concrete hypothesis list (ranked)
- Constraints the agent must respect
Run baseline eval to verify gates work:
```
python3 .lab/eval.py
```
Report the initial composite to the user

If .lab/ already exists, ask the user: resume existing lab, or archive to .lab.bak.<timestamp>/ and start fresh?

Eval-Gen Workflow (`/autoresearch eval-gen`)

Regenerate eval gates without re-scaffolding everything:

python3 ~/.claude/skills/autoresearch/scripts/eval_gen.py --repo-root . --output .lab/eval.py

Review the generated gates. The user may want to:

Add custom gates for domain-specific behavior
Adjust tier weights in .lab/config.json
Add behavioral gates that test specific CLI invocations or API endpoints

Gates follow a 4-tier architecture:

Tier	Weight	What it measures	Anti-cheat
T1: Build+Test	0.20	Compiles, tests pass, lint clean	Runs real commands, sums pass counts
T2: Behavioral	0.40	Integration tests, CLI output, API responses	Validates content, not file existence
T3: Pipeline	0.25	Build artifacts, installs, real I/O	File size >1KB, header validation
T4: Documentation	0.15	Test count floor, doc coverage	Counts code, never trusts comments

Run Workflow (`/autoresearch run`)

python3 .lab/runner.py --max-iterations 50

Or for a dry run (prints hypothesis, creates no files):

python3 .lab/runner.py --dry-run --max-iterations 1

Monitor progress in a separate terminal:

tail -f .lab/results.tsv

The runner:

Loads config from .lab/config.json
Reads program.md for constraints and hypothesis direction
Creates an autoresearch/{date} branch
Loops: hypothesis via claude -p -> implement via claude -p -> git commit -> eval -> keep/discard
Logs every experiment to .lab/results.tsv with extended statuses:

Status	Meaning
`KEEP`	Composite improved >= keep_threshold
`KEEP*`	Primary improved but secondary metric regressed
`DISCARD`	No improvement, reverted
`INTERESTING`	Negative result that reveals structure, logged to dead-ends
`CRASH`	Eval infrastructure failure, reverted
`TIMEOUT`	Experiment exceeded timeout, logged as crash

Checks 9 convergence signals after each experiment (see references/convergence-signals.md)
Re-validates baseline every 10 real experiments
Auto-generates .lab/eval-report.md with cumulative progress

Status Workflow (`/autoresearch status`)

python3 ~/.claude/skills/autoresearch/scripts/report.py --repo-root .

Shows: composite (live), experiment timeline, keeps/discards/crashes, active convergence signals, branch genealogy, dead-ends.

Resume Workflow (`/autoresearch resume`)

Check if .lab/ exists
If yes: read config.json, results.tsv, tail of log.md
Present summary: objective, metrics, experiment count, current best vs baseline, last status
Ask: resume (continue from last experiment) or fresh (archive .lab.bak.<timestamp>/)
If resume: check for stale lock file, clean up if needed, then run

.lab/ Directory Layout

.lab/                          # gitignored — experiment knowledge store
  config.json                # All parameters (repo_name, build_cmd, keep_threshold, etc.)
  runner.py                  # Customized runner (from runner_template.py)
  eval.py                    # Generated + user-extended eval gates
  eval_base.py               # Base framework (gate registration, composite scoring)
  program.md                 # Human-maintained constraints + priorities
  results.tsv                # Experiment log (experiment_id, branch, parent, commit,
                             #   composite, status, duration_s, description)
  log.md                     # Narrative per-experiment entries
  branches.md                # Branch registry
  dead-ends.md               # Falsified approaches + why they failed
  parking-lot.md             # Deferred ideas for later
  eval-report.md             # Auto-generated cumulative report
  runner-*.log               # Runner stdout/stderr logs
  .runner.lock               # PID lock file (prevents concurrent runs)

Why .lab/ not autoresearch/: Code state (git) and experiment knowledge (.lab/) are fully decoupled. git reset --hard HEAD~1 (the core discard mechanic) never touches .lab/. Results survive branch operations.

Three-Tier Output Protocol

Eval gates emit structured diagnostics to stderr:

GATE build=PASS              # Binary — blocks iteration on FAIL
METRIC test_count=475        # Continuous — tracked in results.tsv
TRACE gate_duration_ms=3200  # Execution data — for debugging only

Scripts Reference

Script	Purpose	Run from
`scripts/detect_stack.py`	Detect language, build system, test runner	Skill dir
`scripts/scaffold.py`	Create `.lab/` with all files	Skill dir
`scripts/eval_gen.py`	Generate adversarial eval gates	Skill dir
`scripts/report.py`	Render status report	Skill dir
`scripts/runner_template.py`	Template copied to `.lab/runner.py`	Skill dir
`assets/eval_base.py`	Base eval framework copied to `.lab/`	Skill dir
`assets/config.json.tmpl`	Config template with documented fields	Skill dir
`assets/program.md.tmpl`	Program.md template	Skill dir

All scripts run with python3 (no special dependencies). Use uv run if preferred.

Gotchas

ANTHROPIC_API_KEY in environment: The runner strips it so claude -p uses subscription auth (not pay-per-use API). If you want API auth, set use_api_key: true in config.json.
Gate stochasticity: If gates produce different scores on the same code, the runner will thrash between keep/discard. All gates must be deterministic.
Large dt on resume: If the machine suspends during a run, the runner handles it gracefully via atexit + lock file cleanup.
Eval crashes vs gate crashes: An eval crash (eval.py itself fails) aborts the iteration. A gate crash (one gate throws) is logged in crashed_gates and excluded from composite.

Post-Run Checklist

After every autoresearch run:

tail -f .lab/results.tsv — review keeps/discards
Read .lab/eval-report.md for cumulative progress and ceiling detection
Merge the autoresearch branch to main if satisfied
Update .lab/program.md dead ends with falsified approaches
Run python3 .lab/eval.py to confirm final composite

Never

Never modify .lab/eval_base.py or .lab/runner.py during a run
Never run two runners concurrently (lock file prevents this, but don't bypass)
Never commit .lab/ to git (it's gitignored for a reason)
Never trust a composite that includes crashed gates

Related skills

More from tdimino/claude-code-minoan

Installs

Repository

tdimino/claude-…e-minoan

GitHub Stars

First Seen

Mar 30, 2026

Security Audits

Gen Agent Trust HubPass

SocketWarn

autoresearch

Five Invariants (never violate)

Safety rules

Autoresearch

Category

Quick Start

Command Dispatch

Init Workflow (`/autoresearch init`)

Eval-Gen Workflow (`/autoresearch eval-gen`)

Run Workflow (`/autoresearch run`)

Status Workflow (`/autoresearch status`)

Resume Workflow (`/autoresearch resume`)

.lab/ Directory Layout

Three-Tier Output Protocol

Scripts Reference

Gotchas

Post-Run Checklist

Never

More from tdimino/claude-code-minoan

academic-research

travel-requirements-expert

twilio-api

figma-mcp

firecrawl

scrapling

autoresearch

Five Invariants (never violate)

Safety rules

Autoresearch

Category

Quick Start

Command Dispatch

Init Workflow (/autoresearch init)

Eval-Gen Workflow (/autoresearch eval-gen)

Run Workflow (/autoresearch run)

Status Workflow (/autoresearch status)

Resume Workflow (/autoresearch resume)

.lab/ Directory Layout

Three-Tier Output Protocol

Scripts Reference

Gotchas

Post-Run Checklist

Never

More from tdimino/claude-code-minoan

academic-research

travel-requirements-expert

twilio-api

figma-mcp

firecrawl

scrapling

Init Workflow (`/autoresearch init`)

Eval-Gen Workflow (`/autoresearch eval-gen`)

Run Workflow (`/autoresearch run`)

Status Workflow (`/autoresearch status`)

Resume Workflow (`/autoresearch resume`)