autoresearch-fleet
Autoresearch Fleet
Autonomous research loop inspired by karpathy/autoresearch. One mutable file, one immutable eval harness, git as state machine, and a "NEVER STOP" directive. The agent modifies code, evaluates the result, keeps improvements, discards regressions, and repeats indefinitely.
Open-world extension: when the agent plateaus (N consecutive discards), the orchestrator injects a web-search prompt, breaking through knowledge ceilings the LLM can't cross alone.
When to use
- Optimizing a single metric (latency, accuracy, loss, score)
- The problem has a fast, deterministic eval harness
- You want autonomous overnight runs (100+ experiments while you sleep)
- The search space is too large for manual exploration
How it works
┌─────────────────────────────────────────────────┐
│ orchestrator.sh │
│ │
│ for each iteration: │
│ 1. Count trailing discards in results.tsv │
│ 2. If >= plateau_threshold → search prompt │
│ 3. Spawn agent (claude -p or codex exec) │
│ 4. Agent reads program.md, edits file, evals │
│ 5. Agent updates results.tsv, keeps/reverts │
│ 6. Check stop conditions (iter/cost/plateau) │
│ 7. Loop │
└─────────────────────────────────────────────────┘
The agent handles everything: reading files, editing code, running eval, committing, updating results.tsv, and reverting on failure. The orchestrator just loops, switches prompts on plateau, and enforces stop conditions.
Directory structure
$FLEET_ROOT/ # The problem directory
fleet.json # Fleet config
program.md # Agent instructions (you write this)
eval.py # Immutable eval harness (you write this)
solution.py # Mutable file (agent edits this)
results.tsv # Experiment log (agent updates, git-untracked)
orchestrator.sh # Generated by launch.sh
.orch-state.json # Iteration state
.paused # Sentinel (touch to pause)
logs/
session-iter-1.jsonl # Per-iteration session logs
session-iter-2.jsonl
...
fleet.json schema
{
"fleet_name": "optimize-api-latency",
"type": "autoresearch",
"config": {
"model": "sonnet",
"fallback_model": "haiku",
"provider": "claude",
"budget_per_iter": 2.00,
"max_turns": 0
},
"problem": {
"workdir": "/home/user/my-project",
"eval_command": "make benchmark",
"metric_regex": "^p99_latency_ms:\\s*([0-9.]+)",
"metric_direction": "minimize",
"results_file": "results.tsv",
"program_md": "program.md"
},
"stop_when": {
"max_iterations": 30,
"cost_cap_usd": 15.0
},
"search": {
"enabled": true,
"plateau_threshold": 3
}
}
Config fields
| Field | Default | Description |
|---|---|---|
config.model |
sonnet |
Agent model |
config.fallback_model |
haiku |
Fallback model (must differ from model) |
config.provider |
claude |
claude or codex |
config.budget_per_iter |
1.00 |
Max USD per iteration |
config.max_turns |
0 |
Max agent turns (0 = unlimited) |
problem.workdir |
fleet root | Working directory — the repo/dir the agent operates in. Fleet root stores config + logs only. |
problem.eval_command |
required | Command to run evaluation (python3 eval.py, make benchmark, pytest --tb=short, etc.) |
problem.metric_regex |
(optional) | Regex to extract metric from eval output. Must have one capture group. Omit if eval prints a single number. |
problem.metric_direction |
minimize |
minimize or maximize |
problem.results_file |
results.tsv |
TSV log file (in workdir) |
problem.program_md |
program.md |
Agent instructions file (checked in workdir first, then fleet root) |
stop_when.max_iterations |
50 |
Hard iteration limit |
stop_when.cost_cap_usd |
0 |
Total cost limit (0 = no limit) |
search.enabled |
true |
Enable plateau-triggered web search |
search.plateau_threshold |
3 |
Consecutive discards before search |
Required inputs
You need 3 things in a fleet root directory:
fleet.json— pointsproblem.workdirat the target repo, setseval_commandprogram.md— agent instructions. Must say NEVER STOP.- An eval command — anything that outputs a metric
fleet-root/ ← fleet.json + program.md + logs
fleet.json
program.md
logs/ ← created automatically
your-repo/ ← workdir (agent operates here)
src/...
results.tsv ← created automatically
When the user doesn't specify a benchmark
If the user gives you a repo and a goal but no eval command:
- Check for existing benchmarks: look for
Makefiletargets (make benchmark,make perf),package.jsonscripts (npm run bench,yarn test),pytestmarkers (pytest -m benchmark), orbench/directories. - If found: use it as
eval_command. Setmetric_regexif it doesn't print a single number. - If not found: write a benchmark script (
bench.shorbench.py) in the workdir that:- Runs the relevant operation (API call, function invocation, build, test suite)
- Measures the metric the user cares about (latency, pass rate, bundle size, etc.)
- Prints a single number to stdout
- Exits 0 on success, non-zero on crash
- Set
eval_commandto run this script.
Example: user says "optimize API latency in my Express app":
#!/usr/bin/env bash
# bench.sh — measure p99 latency
npm start &>/dev/null &
PID=$!
sleep 3
RESULT=$(curl -s -o /dev/null -w '%{time_total}' http://localhost:3000/api/health)
kill $PID 2>/dev/null
echo "$RESULT"
Then set "eval_command": "bash bench.sh" in fleet.json.
Setup
- Create fleet root with
fleet.json+program.md(+bench.shif you wrote one) bash ${CLAUDE_SKILL_DIR}/scripts/launch.sh <fleet-root>(git init + results.tsv auto-created)bash ${CLAUDE_SKILL_DIR}/scripts/status.sh <fleet-root>to monitor
Available scripts
| Script | Purpose |
|---|---|
launch.sh <fleet-root> [--dry-run] |
Generate orchestrator.sh, spawn in tmux with monitor |
status.sh <fleet-root> [--watch] |
Show iteration, best metric, results.tsv, cost, plateau |
view.sh <fleet-root> <iter|latest> [--follow] |
View parsed session events for a specific iteration |
report.sh <fleet-root> [--output file.md] |
Generate markdown summary after run completes |
pause.sh <fleet-root> |
Pause at next iteration boundary |
resume.sh <fleet-root> |
Resume paused fleet |
kill.sh <fleet-root> |
Hard stop: kill tmux, sweep orphans |
program.md template
Your program.md should follow this structure (adapt to your problem):
# autoresearch: <problem description>
## Setup
1. Explore the codebase to understand the architecture.
2. Read `results.tsv` for prior experiment history.
3. Run `<eval_command>` to establish a baseline.
## Rules
- Goal: <minimize|maximize> the metric.
- Make ONE change per experiment. Keep changes focused.
- <any constraints: don't touch tests, don't modify config, etc.>
## The experiment loop
LOOP FOREVER:
1. Read results.tsv for context on what's been tried.
2. Make ONE change to the codebase.
3. `git add -A && git commit -m "short description"`
4. Run: `<eval_command>`
5. Record in results.tsv (tab-separated): `commit metric status description`
6. If metric improved: keep the commit.
7. If worse or crash: `git reset --hard HEAD~1` and log as discard/crash.
8. Go to step 1.
**NEVER STOP.** Run until manually interrupted.
Key design principles (from Karpathy)
- Git as state machine — improvement = advance branch; regression = reset
- Fixed eval — makes all experiments comparable
- results.tsv as shared memory — agent reads history to avoid repeating failures
- NEVER STOP — agent runs autonomously until killed
- Simplicity criterion — a small gain with ugly complexity is not worth it
Open-world search (the extension)
When search.enabled is true, the orchestrator counts trailing discards in results.tsv. If the count exceeds search.plateau_threshold, the next iteration gets a search-augmented prompt telling the agent to use WebSearch before coding.
This is validated: in experiment 009, search found Winograd's Strassen variant (15 additions vs 18) — a technique not in the LLM's training data — breaking through a plateau where vanilla autoresearch was stuck.
Critical: plateau detection is done in bash (deterministic), not by the LLM. The agent miscounted consecutive discards in early experiments, hallucinating plateaus. External counting is mandatory.
Rationalizations to reject
| Agent says | Rebuttal |
|---|---|
| "The agent should search every iteration for best results" | Search-on-plateau beats always-search. Most early searches are redundant and add latency. Only search when stuck (3+ discards). |
| "I should manage git from the orchestrator" | The agent handles git. It can fix commit messages, handle edge cases, and revert intelligently. The orchestrator just loops. |
| "The eval harness can reuse the same inputs" | Reusing inputs is gameable. The agent will discover identity-based memoization and optimize for the benchmark, not the problem. Use fresh seeded inputs per timed run. |
| "I should use iterative-fleet for this" | Iterative-fleet has a reviewer. Autoresearch has no reviewer — the eval script IS the quality gate. Different pattern, different skill. |
$ARGUMENTS