autoresearch
Autoresearch — Autonomous Iterative Optimization
You define the goal and how to measure it. The agent does the rest: hypothesize, edit, measure, keep or revert — running autonomously until you stop it or the budget runs out.
Inspired by Karpathy's autoresearch, adapted for Python, SQL, pytest, ty, and ML workflows.
[!IMPORTANT] Every experiment is committed before running, and reverted on failure. The branch only ever advances on real improvements.
Phase 1 — Setup (interactive)
Work through these questions with the user before touching any code. Do not skip or assume answers.
1.1 Goal
Ask:
What are you trying to improve?
Examples: execution time, memory usage, pytest pass rate, ty error count, SQL query latency, model accuracy, training throughput, bundle size.
1.2 Metric command
Ask:
What command produces the metric, and how do I read the number from its output?
- Command — the exact shell command to run
- Extraction — regex, line number, JSON path, or description of what to parse
- Direction — lower is better, or higher is better?
Refer to the domain quick-reference below for ready-made commands if the user is unsure.
Record:
METRIC_COMMANDMETRIC_EXTRACTIONMETRIC_DIRECTION(lower_is_better|higher_is_better)
1.3 Scope
Ask:
Which files or directories may I edit? Which are off-limits?
Record:
IN_SCOPE— files/dirs the agent may modifyOUT_OF_SCOPE— must not be touched
1.4 Constraints
Ask:
Any constraints I should respect?
Examples: no new dependencies, must keep existing tests green, public API must stay stable, max 2 min per run, must stay type-clean (ty 0 errors), VRAM budget, complexity budget.
Record as CONSTRAINTS.
1.4b Web search
Ask:
May I search the web for optimization ideas, documentation, or techniques?
Web search lets me look up library docs, algorithm papers, Stack Overflow answers, and benchmarking guides to generate better hypotheses — especially useful for ML tuning, SQL optimization, and unfamiliar libraries.
Options:
yes(search freely),ask(propose each query before running),no(stay offline, codebase only).
Record as WEB_SEARCH (yes | ask | no, default no).
1.5 Budget
Ask:
How many experiments, or keep going until interrupted?
Record as MAX_EXPERIMENTS (number or unlimited).
1.6 Simplicity policy
State the default and ask for adjustments:
Default: simpler beats marginally faster. Removing code while holding or improving the metric is a win. Complexity has a cost — weigh it honestly against the gain. OK to proceed with this policy, or do you want to adjust it?
Record any adjustment as SIMPLICITY_POLICY.
1.7 Confirm
Present a summary table and wait for explicit confirmation before continuing.
| Parameter | Value |
|---|---|
| Goal | |
| Metric command | |
| Metric extraction | |
| Direction | |
| In-scope | |
| Out-of-scope | |
| Constraints | |
| Max experiments | |
| Simplicity policy | |
| Web search |
Parallel autoresearch
Run multiple experiment loops simultaneously, each on its own branch and working
directory, using git worktree. Useful when you want to race two hypotheses, explore
independent scopes concurrently, or keep your main editor on main while experiments run.
When to use parallel mode
- Two or more independent scopes (different files or modules) that won't conflict
- Long-running metric commands where waiting for one loop blocks progress
- Competing strategies you want to benchmark side-by-side
Setup
For each parallel run, create a dedicated worktree before starting its loop:
# In the repo root — repeat for each parallel run
RUN_ID_A="<goal-a>" # e.g. etl-runtime
RUN_ID_B="<goal-b>" # e.g. ty-errors
git worktree add ../<repo>-${RUN_ID_A} -b autoresearch/${RUN_ID_A}
git worktree add ../<repo>-${RUN_ID_B} -b autoresearch/${RUN_ID_B}
Each worktree is a sibling directory with its own HEAD and index — commits,
resets, and file edits in one worktree are completely invisible to the others.
Guidelines
| Rule | Reason |
|---|---|
Non-overlapping IN_SCOPE |
Two loops editing the same file will overwrite each other's changes — the merge conflict is yours to resolve |
| Each run gets its own worktree | Never share a directory between two loops |
Separate RESULTS_FILE per run |
Use the unique RUN_ID — they naturally don't collide |
| Merge order matters | When both runs finish, merge the one with the larger improvement first; re-run the other's baseline before merging it to get an honest combined measurement |
| VS Code workspace | File → Add Folder to Workspace for each worktree directory — all runs visible side-by-side |
Cleanup
When a parallel run finishes, remove its worktree:
git worktree remove ../<repo>-${RUN_ID_A} # removes directory + unregisters
# or, if the run produced no improvement:
git worktree remove --force ../<repo>-${RUN_ID_A}
git branch -D autoresearch/${RUN_ID_A}
List active worktrees at any time:
git worktree list
Phase 2 — Branch & baseline
-
Generate a run ID — a short goal slug, e.g.
pytest-passrate. This ID is used for all artifacts of this run so multiple runs never collide.RUN_ID="<goal-slug>" # e.g. pytest-passrate RESULTS_FILE="autoresearch-${RUN_ID}.tsv" LOG_FILE="autoresearch-${RUN_ID}.log" -
Create a branch — propose
autoresearch/<run-id>, create it:git checkout -b autoresearch/${RUN_ID} -
Read in-scope files — build full context before making any changes.
-
Initialize
$RESULTS_FILEin the repo root:experiment commit metric status descriptionRegister both files in
.git/info/exclude(append only — never modify tracked files):echo "${RESULTS_FILE}" >> .git/info/exclude echo "${LOG_FILE}" >> .git/info/exclude -
Run the baseline — execute
METRIC_COMMAND, extract the value, record as experiment0with statusbaseline. -
Report to the user:
Baseline: [metric] = [value]. Run ID:
<run-id>. Starting experiment loop.
Phase 3 — Experiment loop
Run continuously. Never pause to ask "should I continue?". Stop only when:
MAX_EXPERIMENTSis reached, or- the user interrupts
Each iteration
QUESTION (Questioner) Profile the current state before hypothesizing.
Ask structured questions about the code/data to surface bottlenecks:
1. What changed since last iteration? (diff awareness)
2. Where is time/resource actually spent? (profile or re-read metrics)
3. What patterns in prior results suggest a direction?
- 2+ consecutive keeps in same area → probe deeper
- 3+ discards → pivot strategy
- crash → structural issue, not parameter issue
4. Are there external signals to incorporate?
(error messages, profiler output, log patterns)
Record answers as QUESTIONER_NOTES for this iteration.
THINK Synthesize QUESTIONER_NOTES into a hypothesis.
Use domain reasoning vocabulary (see below) to sharpen the hypothesis.
Form: "X should improve Y because Z."
If WEB_SEARCH is `yes` or `ask` and you are stuck or entering a new
domain (e.g. unfamiliar library, ML algorithm, SQL planner behavior),
search the web for relevant techniques, docs, or benchmarks before
forming the hypothesis. For `ask`, state the proposed query and wait
for confirmation. Log the source URL in the description column of the
TSV when a web result directly inspired the experiment.
Follow experiment strategy priority below.
SCORE Rate the hypothesis before acting (1–10 each):
- Impact: how much metric improvement expected?
- Feasibility: how likely to work without breaking things?
- Novelty: how different from what was already tried? (check $RESULTS_FILE)
Average ≥ 5 → proceed. Below 5 → generate a better hypothesis.
Skip SCORE on experiment #1 (no prior data to compare against).
REFLECT Self-check before editing:
- What assumption am I making that could be wrong?
- Has something similar already been tried and failed? (scan $RESULTS_FILE)
- Am I stuck in a local optimum? (3+ keeps in same area → try different axis)
- Could this change make things worse in a way I won't measure?
If reflection reveals a flaw → revise hypothesis and re-SCORE.
EDIT Make one focused change to in-scope files.
Keep it minimal — one idea per experiment.
COMMIT Stage only in-scope files, then commit:
git add <IN_SCOPE files> && git commit
Message format: "experiment: <short description>"
RUN Execute METRIC_COMMAND, redirect all output:
<command> > $LOG_FILE 2>&1
MEASURE Extract the metric from $LOG_FILE.
On failure: read the last 50 lines of run.log for the error.
INSPECT (Inspector) Validate the experiment beyond just the metric.
Checklist — all must pass for a "keep" decision:
☐ Metric improved (or held, if simplification pass)
☐ Change is minimal and focused (one idea, not a kitchen sink)
☐ No unrelated regressions introduced (test suite, type checker)
☐ Code complexity did not increase disproportionately to the gain
☐ The change is understandable — would a reviewer accept it?
☐ No hardcoded magic values that only work for the benchmark
☐ Description matches actual change (no hallucinated improvements)
☐ Metric improvement is real, not positive spin on noise
If any check fails, downgrade to "discard" even if metric improved.
Record inspector verdict as INSPECTOR_NOTES.
DECIDE Compare to current best (incorporating INSPECTOR_NOTES):
✅ IMPROVED → keep commit, update best, log status = "keep"
❌ SAME/WORSE → revert only in-scope files:
git reset HEAD~1 # soft-reset: undo commit, keep working tree
git restore <IN_SCOPE files> # discard changes to in-scope files only
log status = "discard"
⚠️ METRIC UP BUT INSPECTOR FAIL → revert, log status = "inspector-reject"
Record why: "metric +12% but added 40 LOC of unmaintainable code"
💥 CRASH → attempt up to 2 quick fixes (typo, import, simple error),
amend commit, re-run. If still broken, soft-reset and
restore only in-scope files; log status = "crash".
LOG Append to results.tsv:
<N> <commit> <value> <status> <description>
Domain reasoning vocabulary
Insert these domain-specific keywords into your THINK step to sharpen hypothesis quality. Using precise terminology activates better reasoning patterns.
| Domain | Keywords to use in hypotheses |
|---|---|
| Python perf | bottleneck, hot path, cache locality, allocation pressure, GIL contention, vectorize, amortize |
| SQL perf | cardinality, selectivity, partition pruning, predicate pushdown, shuffle, skew, broadcast threshold |
| ML training | learning rate schedule, gradient norm, batch size saturation, overfitting signal, regularization strength |
| Test coverage | uncovered branch, edge case, boundary condition, error path, mock boundary |
| Memory | peak allocation, fragmentation, object lifetime, reference cycle, weak reference |
| General | invariant, precondition, tight loop, early exit, short-circuit, amortized cost |
Experiment strategy
Follow this priority order:
- Low-hanging fruit — obvious inefficiencies, trivial parameter changes
- Follow promising directions — if something worked, probe further
- Diversify after plateaus — 3–5 consecutive failures → switch strategy
- Combine winners — if A and B each improved independently, try A+B
- Simplification passes — periodically try removing code; hold the metric
- Bigger changes — algorithmic or architectural changes after incremental ideas run dry
Constraint enforcement
- Time budget: if a run exceeds 2× baseline duration, kill and treat as crash
- Test integrity: if constraints require green tests, run them after each experiment; revert if they break, even if the primary metric improved
- ty/type safety: if type-cleanliness is a constraint, run
uv run ty checkafter each change and revert if new errors appear
Phase 4 — Report, cleanup & next steps
When the loop ends (budget reached or interrupted), work through all four sub-phases.
4.1 Results report
-
Print
$RESULTS_FILEas a formatted table. -
Summarize:
- Total experiments / kept / discarded / crashed
- Baseline → final metric, improvement %
- Top 3 most impactful changes (by metric delta)
-
Show the git log of kept experiments:
git log --oneline <start-commit>..HEAD
4.1b Run-level review
After reporting results, perform a structured review of the entire optimization run. Rate the run on these axes (1–10):
| Axis | Question |
|---|---|
| Soundness | Are the metric improvements real and reproducible, or could they be measurement artifacts? |
| Quality | Is the final code better than baseline? Would a senior engineer approve the diff? |
| Significance | Is the improvement meaningful enough to justify the complexity introduced? |
| Completeness | Were the most promising directions explored, or did the loop get stuck early? |
Present scores and a 2–3 sentence summary. Flag:
- Experiments where metric improved but the change may be fragile
- Promising directions that were not explored (ideas scored but never tried)
- Any sign of overfitting to the specific benchmark input
- Inspector-rejected experiments that might be worth revisiting with a different approach
If the run-level review reveals concerns, note them in the report before cleanup.
4.2 Cleanup
Remove run artifacts that are no longer needed. Ask the user once before deleting:
Run complete. Clean up
$RESULTS_FILEand$LOG_FILEfrom the working directory? (They stay in git history if you need them later.)
If confirmed:
rm -f "${RESULTS_FILE}" "${LOG_FILE}"
Also remove the excludes entries added in Phase 2 so the file is left tidy:
# removes the two lines added during setup (grep -v is safe here — no tracked files touched)
grep -v "^${RESULTS_FILE}$\|^${LOG_FILE}$" .git/info/exclude > /tmp/_exclude_tmp \
&& mv /tmp/_exclude_tmp .git/info/exclude
4.3 Proposed next steps
Present these as a checklist. Mark which apply based on what the run actually changed.
Code quality
- Run
/deslopon changed files — automated optimization often leaves mechanical patterns, naming inconsistencies, or removed comments that need a pass - Run
tyacross the full project to confirm no new type errors leaked in - Run the full test suite one final time on the current branch tip
Commit hygiene
- Squash experiment commits into one or a few logical commits before merging:
bash git rebase -i <start-commit>Replace allexperiment:commits with meaningful messages describing what changed and why it helped. - Update any affected docstrings or inline comments that describe the old behavior
Integration
- Open a PR from
autoresearch/<run-id>into your base branch - Add a note in the PR description linking to the
$RESULTS_FILE(or paste the summary table) so reviewers understand the methodology
Further experimentation
- Things not tried (ideas the loop skipped as too risky or complex): — algorithmic rewrites that would change public interfaces — dependency upgrades — schema or data-structure changes — parallelism / concurrency changes Decide which are worth pursuing manually.
4.4 Branch lifecycle
If the run produced no net improvement, offer to delete the branch cleanly:
git checkout <base-branch>
git branch -D autoresearch/${RUN_ID}
If improvements were made, leave the branch for PR review.
Domain metric quick-reference
Use these when the user isn't sure how to phrase their metric command.
Python runtime
# hyperfine (install: brew install hyperfine / pip install hyperfine)
hyperfine --warmup 3 'python my_script.py'
# → parse: "mean" field (lower is better)
# built-in timing
python -m timeit -n 100 -r 5 "import my_module; my_module.run()"
# cProfile summary
python -m cProfile -s cumtime my_script.py 2>&1 | head -20
Python memory
# tracemalloc (add to script, or use wrapper)
python -c "
import tracemalloc, my_module
tracemalloc.start()
my_module.run()
current, peak = tracemalloc.get_traced_memory()
print(f'peak_kb={peak/1024:.1f}')
"
# → parse: peak_kb= line (lower is better)
# memory_profiler (pip install memory_profiler)
python -m memory_profiler my_script.py
pytest
# pass rate
pytest --tb=no -q 2>&1 | tail -1
# → parse: "X passed" (higher is better)
# coverage
pytest --cov=src --cov-report=term-missing --tb=no -q 2>&1 | grep "TOTAL"
# → parse: last percentage (higher is better)
# duration
pytest --tb=no -q 2>&1 | grep "passed"
# → parse duration from summary line (lower is better)
ty
# count type errors (lower is better)
uv run ty check 2>/dev/null | grep -c "^error\["
# exit code only: 0 = clean, 1 = errors
uv run ty check 2>/dev/null; echo "exit=$?"
SQL (PostgreSQL)
# query duration via psql (wrap your query in EXPLAIN ANALYZE)
psql $DATABASE_URL -c "EXPLAIN (ANALYZE, FORMAT JSON) <your query>" \
| python -c "import sys,json; d=json.load(sys.stdin); print(d[0]['Execution Time'])"
# → parse: Execution Time (lower is better)
# if using pgbench
pgbench -c 5 -T 30 $DATABASE_URL 2>&1 | grep "tps ="
# → parse: tps value (higher is better)
Machine learning
# training run — capture final metric from stdout
python train.py --epochs 10 2>&1 | grep "val_loss" | tail -1
# → parse: val_loss= value (lower is better)
# sklearn cross-validation
python -c "
from sklearn.model_selection import cross_val_score
import numpy as np, my_model, my_data
X, y = my_data.load()
scores = cross_val_score(my_model.build(), X, y, cv=5, scoring='f1_macro')
print(f'f1={np.mean(scores):.4f}')
"
# → parse: f1= (higher is better)
Spark / Databricks (via Spark Connect)
Run PySpark scripts locally against a remote Databricks cluster using
databricks-connect. The cluster does the heavy lifting; the local process is
your experiment coordinator. See the spark-connect skill for setup details.
# Time a PySpark script end-to-end (local process + cluster execution)
hyperfine --warmup 1 'uv run python notebooks/my_pipeline.py'
# → parse: "mean" field (lower is better)
# Capture a row count or computed metric from Spark SQL
uv run python -c "
from utils.databricks import get_spark_session
spark = get_spark_session()
result = spark.sql('''
SELECT COUNT(*) as n FROM my_catalog.my_schema.my_table
WHERE condition = true
''').collect()[0]['n']
print(f'metric={result}')
"
# → parse: metric= (direction depends on goal)
# Capture MLflow metric from a training run
uv run python -c "
import mlflow
client = mlflow.tracking.MlflowClient()
run = client.search_runs(['<experiment_id>'], order_by=['metrics.val_f1 DESC'], max_results=1)[0]
print(f'val_f1={run.data.metrics[\"val_f1\"]:.4f}')
"
# → parse: val_f1= (higher is better)
Autoresearch loop model with Spark Connect:
Local machine (autoresearch loop)
└─ edits notebook / script
└─ commits the change
└─ runs: uv run python notebooks/my_script.py
└─ get_spark_session() → Spark Connect → Databricks Cluster
└─ reads metric from stdout / MLflow
└─ keep or revert based on metric
└─ next iteration
Constraints to consider for Spark runs:
- Each run spins up or warms up the Databricks cluster — cold starts add ~1–2 min
- Use
hyperfine --warmup 1(not 3+) to avoid excessive cluster cost - Set a generous time budget cap:
2× baselineis fine for inter-cluster runs - Prefer
spark.sql(...)over PySpark DataFrame API for readability and planability - If
DATABRICKS_CLUSTER_IDorDATABRICKS_PROFILEare not set, fail fast with a clear message rather than hanging
Results TSV format
Filename: autoresearch-<run-id>.tsv (e.g. autoresearch-pytest-passrate.tsv)
Tab-separated, 5 columns:
experiment commit metric status description
0 a1b2c3d 142.3 baseline unmodified code
1 b2c3d4e 138.1 keep replace list comprehension with generator
2 c3d4e5f 145.0 discard switch to numpy vectorization (slower on small data)
3 d4e5f6g 0.0 crash add numba jit (import error, unfixable)
4 e5f6g7h 131.4 keep cache repeated db lookups with lru_cache
5 f6g7h8i 128.9 inspector-reject inline dict — metric +2% but +40 LOC unmaintainable
Key principles
| Principle | Why it matters |
|---|---|
| Measure everything | An unmeasured change is a guess. Every experiment has a number. |
| Revert failures | The branch tells the true story — only improvements survive. |
| Stay autonomous | Stopping to ask breaks the loop. Think harder instead. |
| Simplicity costs | Every line added is future maintenance. Weigh it honestly. |
| Log everything | The TSV is the research journal. Future you will thank present you. |
More from bmsuisse/skills
codeunit-analyzer
>
14deslop
>
14coding-guidelines-python
>
13init-app-stack
Use this skill whenever the user wants to bootstrap, scaffold, or initialize a new full-stack app with a Vite + React + TanStack + shadcn/ui frontend and a FastAPI + Postgres backend. Triggers on requests like "create a new app", "set up a project", "scaffold a full-stack app", "init a new project", or anything involving starting a fresh React/FastAPI application from scratch.
12databricks-sql-autotuner
>
12databricks-cli
>
12