Autoresearch Fleet

Autonomous research loop inspired by karpathy/autoresearch. One mutable file, one immutable eval harness, git as state machine, and a "NEVER STOP" directive. The agent modifies code, evaluates the result, keeps improvements, discards regressions, and repeats indefinitely.

Open-world extension: when the agent plateaus (N consecutive discards), the orchestrator injects a web-search prompt, breaking through knowledge ceilings the LLM can't cross alone.

When to use

Optimizing a single metric (latency, accuracy, loss, score)
The problem has a fast, deterministic eval harness
You want autonomous overnight runs (100+ experiments while you sleep)
The search space is too large for manual exploration

How it works

┌─────────────────────────────────────────────────┐
│                 orchestrator.sh                  │
│                                                  │
│  for each iteration:                             │
│    1. Count trailing discards in results.tsv     │
│    2. If >= plateau_threshold → search prompt    │
│    3. Spawn agent (claude -p or codex exec)      │
│    4. Agent reads program.md, edits file, evals  │
│    5. Agent updates results.tsv, keeps/reverts   │
│    6. Check stop conditions (iter/cost/plateau)  │
│    7. Loop                                       │
└─────────────────────────────────────────────────┘

The agent handles everything: reading files, editing code, running eval, committing, updating results.tsv, and reverting on failure. The orchestrator just loops, switches prompts on plateau, and enforces stop conditions.

Directory structure

$FLEET_ROOT/                    # The problem directory
  fleet.json                    # Fleet config
  program.md                    # Agent instructions (you write this)
  eval.py                       # Immutable eval harness (you write this)
  solution.py                   # Mutable file (agent edits this)
  results.tsv                   # Experiment log (agent updates, git-untracked)
  orchestrator.sh               # Generated by launch.sh
  .orch-state.json              # Iteration state
  .paused                       # Sentinel (touch to pause)
  logs/
    session-iter-1.jsonl         # Per-iteration session logs
    session-iter-2.jsonl
    ...

fleet.json schema

{
  "fleet_name": "optimize-api-latency",
  "type": "autoresearch",
  "config": {
    "model": "sonnet",
    "fallback_model": "haiku",
    "provider": "claude",
    "budget_per_iter": 2.00,
    "max_turns": 0
  },
  "problem": {
    "workdir": "/home/user/my-project",
    "eval_command": "make benchmark",
    "metric_regex": "^p99_latency_ms:\\s*([0-9.]+)",
    "metric_direction": "minimize",
    "results_file": "results.tsv",
    "program_md": "program.md"
  },
  "stop_when": {
    "max_iterations": 30,
    "cost_cap_usd": 15.0
  },
  "search": {
    "enabled": true,
    "plateau_threshold": 3
  }
}

Config fields

Field	Default	Description
`config.model`	`sonnet`	Agent model
`config.fallback_model`	`haiku`	Fallback model (must differ from model)
`config.provider`	`claude`	`claude` or `codex`
`config.budget_per_iter`	`1.00`	Max USD per iteration
`config.max_turns`	`0`	Max agent turns (0 = unlimited)
`problem.workdir`	fleet root	Working directory — the repo/dir the agent operates in. Fleet root stores config + logs only.
`problem.eval_command`	required	Command to run evaluation (`python3 eval.py`, `make benchmark`, `pytest --tb=short`, etc.)
`problem.metric_regex`	(optional)	Regex to extract metric from eval output. Must have one capture group. Omit if eval prints a single number.
`problem.metric_direction`	`minimize`	`minimize` or `maximize`
`problem.results_file`	`results.tsv`	TSV log file (in workdir)
`problem.program_md`	`program.md`	Agent instructions file (checked in workdir first, then fleet root)
`stop_when.max_iterations`	`50`	Hard iteration limit
`stop_when.cost_cap_usd`	`0`	Total cost limit (0 = no limit)
`search.enabled`	`true`	Enable plateau-triggered web search
`search.plateau_threshold`	`3`	Consecutive discards before search

Required inputs

You need 3 things in a fleet root directory:

fleet.json — points problem.workdir at the target repo, sets eval_command
program.md — agent instructions. Must say NEVER STOP.
An eval command — anything that outputs a metric

fleet-root/              ← fleet.json + program.md + logs
  fleet.json
  program.md
  logs/                  ← created automatically
your-repo/               ← workdir (agent operates here)
  src/...
  results.tsv            ← created automatically

When the user doesn't specify a benchmark

If the user gives you a repo and a goal but no eval command:

Check for existing benchmarks: look for Makefile targets (make benchmark, make perf), package.json scripts (npm run bench, yarn test), pytest markers (pytest -m benchmark), or bench/ directories.
If found: use it as eval_command. Set metric_regex if it doesn't print a single number.
If not found: write a benchmark script (bench.sh or bench.py) in the workdir that:
- Runs the relevant operation (API call, function invocation, build, test suite)
- Measures the metric the user cares about (latency, pass rate, bundle size, etc.)
- Prints a single number to stdout
- Exits 0 on success, non-zero on crash
Set eval_command to run this script.

Example: user says "optimize API latency in my Express app":

#!/usr/bin/env bash
# bench.sh — measure p99 latency
npm start &>/dev/null &
PID=$!
sleep 3
RESULT=$(curl -s -o /dev/null -w '%{time_total}' http://localhost:3000/api/health)
kill $PID 2>/dev/null
echo "$RESULT"

Then set "eval_command": "bash bench.sh" in fleet.json.

Setup

Create fleet root with fleet.json + program.md (+ bench.sh if you wrote one)
bash ${CLAUDE_SKILL_DIR}/scripts/launch.sh <fleet-root> (git init + results.tsv auto-created)
bash ${CLAUDE_SKILL_DIR}/scripts/status.sh <fleet-root> to monitor

Available scripts

Script	Purpose
`launch.sh <fleet-root> [--dry-run]`	Generate orchestrator.sh, spawn in tmux with monitor
`status.sh <fleet-root> [--watch]`	Show iteration, best metric, results.tsv, cost, plateau
`view.sh <fleet-root> <iter\|latest> [--follow]`	View parsed session events for a specific iteration
`report.sh <fleet-root> [--output file.md]`	Generate markdown summary after run completes
`pause.sh <fleet-root>`	Pause at next iteration boundary
`resume.sh <fleet-root>`	Resume paused fleet
`kill.sh <fleet-root>`	Hard stop: kill tmux, sweep orphans

program.md template

Your program.md should follow this structure (adapt to your problem):

# autoresearch: <problem description>

## Setup
1. Explore the codebase to understand the architecture.
2. Read `results.tsv` for prior experiment history.
3. Run `<eval_command>` to establish a baseline.

## Rules
- Goal: <minimize|maximize> the metric.
- Make ONE change per experiment. Keep changes focused.
- <any constraints: don't touch tests, don't modify config, etc.>

## The experiment loop
LOOP FOREVER:
1. Read results.tsv for context on what's been tried.
2. Make ONE change to the codebase.
3. `git add -A && git commit -m "short description"`
4. Run: `<eval_command>`
5. Record in results.tsv (tab-separated): `commit  metric  status  description`
6. If metric improved: keep the commit.
7. If worse or crash: `git reset --hard HEAD~1` and log as discard/crash.
8. Go to step 1.

**NEVER STOP.** Run until manually interrupted.

Key design principles (from Karpathy)

Git as state machine — improvement = advance branch; regression = reset
Fixed eval — makes all experiments comparable
results.tsv as shared memory — agent reads history to avoid repeating failures
NEVER STOP — agent runs autonomously until killed
Simplicity criterion — a small gain with ugly complexity is not worth it

Open-world search (the extension)

When search.enabled is true, the orchestrator counts trailing discards in results.tsv. If the count exceeds search.plateau_threshold, the next iteration gets a search-augmented prompt telling the agent to use WebSearch before coding.

This is validated: in experiment 009, search found Winograd's Strassen variant (15 additions vs 18) — a technique not in the LLM's training data — breaking through a plateau where vanilla autoresearch was stuck.

Critical: plateau detection is done in bash (deterministic), not by the LLM. The agent miscounted consecutive discards in early experiments, hallucinating plateaus. External counting is mandatory.

Rationalizations to reject

Agent says	Rebuttal
"The agent should search every iteration for best results"	Search-on-plateau beats always-search. Most early searches are redundant and add latency. Only search when stuck (3+ discards).
"I should manage git from the orchestrator"	The agent handles git. It can fix commit messages, handle edge cases, and revert intelligently. The orchestrator just loops.
"The eval harness can reuse the same inputs"	Reusing inputs is gameable. The agent will discover identity-based memoization and optimize for the benchmark, not the problem. Use fresh seeded inputs per timed run.
"I should use iterative-fleet for this"	Iterative-fleet has a reviewer. Autoresearch has no reviewer — the eval script IS the quality gate. Different pattern, different skill.

$ARGUMENTS

autoresearch-fleet