skills/quickcall-dev/skills/autoresearch-fleet

autoresearch-fleet

Installation
SKILL.md

Autoresearch Fleet

Autonomous research loop inspired by karpathy/autoresearch. One mutable file, one immutable eval harness, git as state machine, and a "NEVER STOP" directive. The agent modifies code, evaluates the result, keeps improvements, discards regressions, and repeats indefinitely.

Open-world extension: when the agent plateaus (N consecutive discards), the orchestrator injects a web-search prompt, breaking through knowledge ceilings the LLM can't cross alone.

When to use

  • Optimizing a single metric (latency, accuracy, loss, score)
  • The problem has a fast, deterministic eval harness
  • You want autonomous overnight runs (100+ experiments while you sleep)
  • The search space is too large for manual exploration

How it works

┌─────────────────────────────────────────────────┐
│                 orchestrator.sh                  │
│                                                  │
│  for each iteration:                             │
│    1. Count trailing discards in results.tsv     │
│    2. If >= plateau_threshold → search prompt    │
│    3. Spawn agent (claude -p or codex exec)      │
│    4. Agent reads program.md, edits file, evals  │
│    5. Agent updates results.tsv, keeps/reverts   │
│    6. Check stop conditions (iter/cost/plateau)  │
│    7. Loop                                       │
└─────────────────────────────────────────────────┘

The agent handles everything: reading files, editing code, running eval, committing, updating results.tsv, and reverting on failure. The orchestrator just loops, switches prompts on plateau, and enforces stop conditions.

Directory structure

$FLEET_ROOT/                    # The problem directory
  fleet.json                    # Fleet config
  program.md                    # Agent instructions (you write this)
  eval.py                       # Immutable eval harness (you write this)
  solution.py                   # Mutable file (agent edits this)
  results.tsv                   # Experiment log (agent updates, git-untracked)
  orchestrator.sh               # Generated by launch.sh
  .orch-state.json              # Iteration state
  .paused                       # Sentinel (touch to pause)
  logs/
    session-iter-1.jsonl         # Per-iteration session logs
    session-iter-2.jsonl
    ...

fleet.json schema

{
  "fleet_name": "optimize-api-latency",
  "type": "autoresearch",
  "config": {
    "model": "sonnet",
    "fallback_model": "haiku",
    "provider": "claude",
    "budget_per_iter": 2.00,
    "max_turns": 0
  },
  "problem": {
    "workdir": "/home/user/my-project",
    "eval_command": "make benchmark",
    "metric_regex": "^p99_latency_ms:\\s*([0-9.]+)",
    "metric_direction": "minimize",
    "results_file": "results.tsv",
    "program_md": "program.md"
  },
  "stop_when": {
    "max_iterations": 30,
    "cost_cap_usd": 15.0
  },
  "search": {
    "enabled": true,
    "plateau_threshold": 3
  }
}

Config fields

Field Default Description
config.model sonnet Agent model
config.fallback_model haiku Fallback model (must differ from model)
config.provider claude claude or codex
config.budget_per_iter 1.00 Max USD per iteration
config.max_turns 0 Max agent turns (0 = unlimited)
problem.workdir fleet root Working directory — the repo/dir the agent operates in. Fleet root stores config + logs only.
problem.eval_command required Command to run evaluation (python3 eval.py, make benchmark, pytest --tb=short, etc.)
problem.metric_regex (optional) Regex to extract metric from eval output. Must have one capture group. Omit if eval prints a single number.
problem.metric_direction minimize minimize or maximize
problem.results_file results.tsv TSV log file (in workdir)
problem.program_md program.md Agent instructions file (checked in workdir first, then fleet root)
stop_when.max_iterations 50 Hard iteration limit
stop_when.cost_cap_usd 0 Total cost limit (0 = no limit)
search.enabled true Enable plateau-triggered web search
search.plateau_threshold 3 Consecutive discards before search

Required inputs

You need 3 things in a fleet root directory:

  1. fleet.json — points problem.workdir at the target repo, sets eval_command
  2. program.md — agent instructions. Must say NEVER STOP.
  3. An eval command — anything that outputs a metric
fleet-root/              ← fleet.json + program.md + logs
  fleet.json
  program.md
  logs/                  ← created automatically
your-repo/               ← workdir (agent operates here)
  src/...
  results.tsv            ← created automatically

When the user doesn't specify a benchmark

If the user gives you a repo and a goal but no eval command:

  1. Check for existing benchmarks: look for Makefile targets (make benchmark, make perf), package.json scripts (npm run bench, yarn test), pytest markers (pytest -m benchmark), or bench/ directories.
  2. If found: use it as eval_command. Set metric_regex if it doesn't print a single number.
  3. If not found: write a benchmark script (bench.sh or bench.py) in the workdir that:
    • Runs the relevant operation (API call, function invocation, build, test suite)
    • Measures the metric the user cares about (latency, pass rate, bundle size, etc.)
    • Prints a single number to stdout
    • Exits 0 on success, non-zero on crash
  4. Set eval_command to run this script.

Example: user says "optimize API latency in my Express app":

#!/usr/bin/env bash
# bench.sh — measure p99 latency
npm start &>/dev/null &
PID=$!
sleep 3
RESULT=$(curl -s -o /dev/null -w '%{time_total}' http://localhost:3000/api/health)
kill $PID 2>/dev/null
echo "$RESULT"

Then set "eval_command": "bash bench.sh" in fleet.json.

Setup

  1. Create fleet root with fleet.json + program.md (+ bench.sh if you wrote one)
  2. bash ${CLAUDE_SKILL_DIR}/scripts/launch.sh <fleet-root> (git init + results.tsv auto-created)
  3. bash ${CLAUDE_SKILL_DIR}/scripts/status.sh <fleet-root> to monitor

Available scripts

Script Purpose
launch.sh <fleet-root> [--dry-run] Generate orchestrator.sh, spawn in tmux with monitor
status.sh <fleet-root> [--watch] Show iteration, best metric, results.tsv, cost, plateau
view.sh <fleet-root> <iter|latest> [--follow] View parsed session events for a specific iteration
report.sh <fleet-root> [--output file.md] Generate markdown summary after run completes
pause.sh <fleet-root> Pause at next iteration boundary
resume.sh <fleet-root> Resume paused fleet
kill.sh <fleet-root> Hard stop: kill tmux, sweep orphans

program.md template

Your program.md should follow this structure (adapt to your problem):

# autoresearch: <problem description>

## Setup
1. Explore the codebase to understand the architecture.
2. Read `results.tsv` for prior experiment history.
3. Run `<eval_command>` to establish a baseline.

## Rules
- Goal: <minimize|maximize> the metric.
- Make ONE change per experiment. Keep changes focused.
- <any constraints: don't touch tests, don't modify config, etc.>

## The experiment loop
LOOP FOREVER:
1. Read results.tsv for context on what's been tried.
2. Make ONE change to the codebase.
3. `git add -A && git commit -m "short description"`
4. Run: `<eval_command>`
5. Record in results.tsv (tab-separated): `commit  metric  status  description`
6. If metric improved: keep the commit.
7. If worse or crash: `git reset --hard HEAD~1` and log as discard/crash.
8. Go to step 1.

**NEVER STOP.** Run until manually interrupted.

Key design principles (from Karpathy)

  1. Git as state machine — improvement = advance branch; regression = reset
  2. Fixed eval — makes all experiments comparable
  3. results.tsv as shared memory — agent reads history to avoid repeating failures
  4. NEVER STOP — agent runs autonomously until killed
  5. Simplicity criterion — a small gain with ugly complexity is not worth it

Open-world search (the extension)

When search.enabled is true, the orchestrator counts trailing discards in results.tsv. If the count exceeds search.plateau_threshold, the next iteration gets a search-augmented prompt telling the agent to use WebSearch before coding.

This is validated: in experiment 009, search found Winograd's Strassen variant (15 additions vs 18) — a technique not in the LLM's training data — breaking through a plateau where vanilla autoresearch was stuck.

Critical: plateau detection is done in bash (deterministic), not by the LLM. The agent miscounted consecutive discards in early experiments, hallucinating plateaus. External counting is mandatory.

Rationalizations to reject

Agent says Rebuttal
"The agent should search every iteration for best results" Search-on-plateau beats always-search. Most early searches are redundant and add latency. Only search when stuck (3+ discards).
"I should manage git from the orchestrator" The agent handles git. It can fix commit messages, handle edge cases, and revert intelligently. The orchestrator just loops.
"The eval harness can reuse the same inputs" Reusing inputs is gameable. The agent will discover identity-based memoization and optimize for the benchmark, not the problem. Use fresh seeded inputs per timed run.
"I should use iterative-fleet for this" Iterative-fleet has a reviewer. Autoresearch has no reviewer — the eval script IS the quality gate. Different pattern, different skill.

$ARGUMENTS

Weekly Installs
11
First Seen
2 days ago
Installed on
opencode11
gemini-cli11
deepagents11
antigravity11
github-copilot11
codex11