autoresearch
AutoResearch
Autonomously run an indefinite ML research loop: modify training code, run a timed experiment, measure one metric, keep improvements, discard regressions — repeat until manually stopped. Inspired by Karpathy's autoresearch and extended with cross-platform support (CUDA / Apple Silicon MPS / CPU / Windows) and user-configurable experiment parameters.
Overview
AutoResearch turns an AI agent into a tireless overnight researcher. Given a training codebase and a metric to minimize (or maximize), the agent continuously proposes code modifications, runs fixed-budget training runs, logs results, and builds a growing record of what works — without requiring human involvement between iterations.
The core loop:
┌──────────────────────────────────────────────────────────────────┐
│ AUTONOMOUS RESEARCH LOOP │
│ │
│ [1] Read current [2] Propose [3] Apply change │
│ codebase → modification → to train.py │
│ ↑ ↓ │
│ [7] Keep / discard [5] Evaluate [4] Commit & │
│ & iterate ← metric ← train (T min) │
│ ↑ ↓ │
│ [6] Log to ────────────────────────────── │
│ results.tsv │
└──────────────────────────────────────────────────────────────────┘
Key design principles (from Karpathy's autoresearch):
- Only one file is ever modified (
train.pyor its equivalent) — data prep and evaluation are read-only - One metric rules everything — lower (or higher) is better, no multi-objective confusion
- Fixed time budget per experiment — ensures fair, comparable runs regardless of architecture size
- No human interruption — the agent runs indefinitely until manually stopped
- Results logged to
results.tsv— a permanent, auditable record
Phase 0 — User Configuration (Ask Before Starting)
Before touching any code, collect all required parameters. Ask explicitly for any missing values.
Required inputs
| # | Parameter | Description | Default if omitted |
|---|---|---|---|
| 1 | Training script path | Path to the file the agent will modify (e.g., train.py) |
train.py in cwd |
| 2 | Read-only script path | Path to the file the agent must never modify (e.g., prepare.py) |
prepare.py in cwd |
| 3 | Time budget per experiment | How long each training run is allowed to run | Ask — do not assume |
| 4 | Metric to optimize | Metric name and direction (minimize val_bpb, maximize val_acc, etc.) |
Ask — do not assume |
| 5 | Max iterations | How many experiment cycles to run (or unlimited for indefinite) |
unlimited |
| 6 | Allowed modification scope | What can be changed: architecture / hyperparameters / optimizer / all | all |
| 7 | Branch tag | Short identifier for the research branch (e.g., mar8, exp-attention) |
Date-based (e.g., mar8) |
| 8 | Results file | Where to log results | results.tsv in cwd |
User-decided parameters (do not hardcode these)
The original autoresearch fixed the training window at 5 minutes. This skill lets the user choose:
"How long should each training run be? (e.g., 3 minutes, 10 minutes, 30 minutes)"
Likewise, the user decides:
- What metric to track (val_bpb, val_loss, accuracy, F1, perplexity — anything the eval harness reports)
- Whether to minimize or maximize it
- How many iterations to attempt before stopping (or run indefinitely)
- Which parts of the code are in scope for modification
Output of Phase 0 — Research Config
Produce and confirm a Research Config before proceeding:
## AutoResearch Config
| Parameter | Value |
|------------------------|-------------------------------|
| Training script | train.py |
| Read-only script | prepare.py |
| Time budget / run | [USER SPECIFIED] |
| Metric | [METRIC NAME] ([min/max]) |
| Max iterations | [N or unlimited] |
| Modification scope | [architecture / hyper / all] |
| Branch | autoresearch/[TAG] |
| Results file | results.tsv |
| Platform | [detected — see Phase 1] |
Get explicit user confirmation before starting Phase 1.
Phase 1 — Platform Detection & Environment Setup
Detect the compute backend and configure the training environment accordingly.
1.1 Platform Detection
Run this detection sequence before touching any training code:
# Pseudo-detection logic (adapt to actual runtime)
import torch
if torch.cuda.is_available():
platform = "CUDA"
device = "cuda"
notes = "FlashAttention-3 may be available; use torch.compile if supported"
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
platform = "MPS (Apple Silicon)"
device = "mps"
notes = "Disable torch.compile; use SDPA instead of FlashAttention; lower batch size for Metal bounds; cast optimizer states explicitly"
else:
platform = "CPU"
device = "cpu"
notes = "Use smaller batch sizes; expect slower iteration; torch.compile may help"
Windows note: On Windows, MPS is not available. CUDA detection is the same as Linux. CPU fallback applies for systems without a supported GPU.
1.2 Platform-Specific Adjustments
Apply these adjustments based on detected platform before the first run:
| Platform | Adjustments |
|---|---|
| CUDA | Enable torch.compile if PyTorch >= 2.0. FlashAttention available if installed. Standard batch sizes apply. |
| MPS (Apple Silicon) | Disable torch.compile (unsupported paths). Replace FlashAttention with PyTorch native SDPA + manual sliding window causal masking. Lower device batch size (reduce by 2–4×). Explicitly cast optimizer states to float32 if Metal errors appear. |
| CPU | Disable torch.compile. Use smaller batch sizes. Gradient accumulation to maintain effective batch. Expect 5–10× slower runs — advise user to use shorter time budgets. |
Log the detected platform and applied adjustments in the research config.
1.3 Branch & Baseline Setup
- Create the research branch:
git checkout -b autoresearch/<tag>from the main branch. - Read all three key files for full context:
- README.md — repository overview and conventions
- [read-only script] — understand the data pipeline, tokenizer, and evaluation harness. Never modify.
- [training script] — understand the current model, optimizer, and training loop. This is the only file to modify.
- Run the baseline training for the configured time budget. Record:
- Baseline metric value
- Peak memory usage (MB)
- Model parameter count
- Initialize
results.tsvwith a header row and the baseline entry:
commit val_bpb mem_gb status description
baseline [BASELINE_VALUE] [MEM] keep Baseline — no modifications
Phase 2 — Autonomous Experiment Loop
This loop runs indefinitely (or until the configured max iterations). Do NOT pause to ask the human between iterations. Work continuously.
2.1 Iteration Start
At the beginning of each iteration:
- Check current git status — confirm working tree is clean.
- Read the current
train.py(or equivalent) in full. - Review the
results.tsvto understand what has been tried and what worked. - Identify the best-performing configuration so far (lowest/highest metric based on direction).
2.2 Hypothesis Generation
Propose one focused modification per iteration. Good candidates include:
Architecture changes:
- Layer count (depth)
- Embedding dimension and attention heads
- Sequence length / context window
- Attention pattern (full / sliding window / mixed)
- Activation functions (ReLU, ReLU², SiLU, GELU)
- MLP expansion ratio
- Value residual / per-layer embeddings
Optimizer changes:
- Learning rate and schedule (warmup steps, cooldown shape)
- Weight decay schedule
- Optimizer type (AdamW, Muon, Lion, SGD variants)
- Gradient clipping
- Per-parameter learning rate multipliers
Training dynamics:
- Batch size and gradient accumulation
- Mixed precision (fp16, bf16)
- Dropout / stochastic depth
- Data ordering or curriculum
Hypothesis discipline:
- One change at a time — never bundle multiple independent modifications
- If the last N iterations all failed, try a significantly different direction
- If a change improved the metric, explore nearby variations before moving to a different dimension
- Prefer changes that are simple and interpretable over complex ones with equal effect
2.3 Apply, Commit, and Run
- Apply the proposed change to the training script.
- Commit with a descriptive message (e.g.,
exp: try ReLU² activation instead of GELU). - Run training for exactly the configured time budget. Redirect output to a log file.
- Extract the metric value from training output using grep or equivalent.
# Example — adapt to actual metric name and log format
uv run train.py 2>&1 | tee run.log
VAL_BPB=$(grep "val_bpb:" run.log | tail -1 | awk '{print $2}')
2.4 Evaluate and Decide
Compare the extracted metric to the current best:
| Outcome | Action |
|---|---|
| Improved (metric better by any amount) | Keep the commit. Update current best. Mark keep in results. |
| No change or regression | Discard: git reset --hard HEAD~1. Mark discard in results. |
| Crash / error | Discard. Log the failure reason. Mark crash in results. Do not retry the same change. |
| Memory OOM | If on MPS: reduce device batch size by 2×, retry once. If still OOM: discard. If on CUDA: reduce batch or enable gradient checkpointing, retry once. |
Simplicity principle (from Karpathy's autoresearch): If two configurations achieve the same metric, prefer the one with fewer lines of code. A deletion that maintains performance is better than an addition.
2.5 Log Results
After every run (kept or discarded), append to results.tsv:
<7-char-commit> <metric-value> <mem_gb> <keep|discard|crash> <one-line description>
Example:
a3f12bc 0.9821 43.2 keep ReLU² activation: -0.0158 improvement
d9e44a1 1.0023 44.1 discard Deeper MLP: regression
c0011f2 — — crash OOM: doubled batch size without accumulation fix
2.6 Progress Reporting
Every 10 iterations (or when manually queried), emit a progress report:
## AutoResearch Progress — Iteration [N]
**Best so far:** [METRIC_VALUE] (iteration [K], commit [HASH])
**vs. Baseline:** [DELTA] ([+/-]%)
**Iterations:** [N completed] / [max or unlimited]
**Time elapsed:** [HH:MM]
**Recent results (last 5):**
| Iter | Commit | Metric | Status | Change |
|------|--------|--------|--------|--------|
| N | abc1234| 0.9821 | keep | ReLU² activation |
| N-1 | def5678| 1.0023 | discard| Deeper MLP |
| ... | | | | |
**Current direction:** [what the agent is exploring next]
Phase 3 — Stopping and Reporting
The loop stops when:
- The user manually stops the agent
- The configured max iterations is reached
- The user's configured stop condition is met (if provided)
3.1 Final Summary
When stopped, produce a Research Summary document (autoresearch-summary.md):
# AutoResearch Summary
**Research session:** autoresearch/<tag>
**Platform:** [CUDA / MPS / CPU]
**Date:** [date range]
**Total iterations:** [N]
**Time budget / run:** [T minutes]
## Results
| Baseline metric | Best metric | Improvement | Best commit |
|-----------------|-------------|-------------|-------------|
| [BASELINE] | [BEST] | [DELTA] | [HASH] |
## Top Improvements (kept commits)
| Rank | Commit | Metric | Description |
|------|--------|--------|-------------|
| 1 | abc123 | 0.9612 | [description] |
| 2 | def456 | 0.9744 | [description] |
| ... | | | |
## What Didn't Work
| Category | Description | Outcome |
|----------|-------------|---------|
| [category] | [description] | [why it failed] |
## Recommended Next Steps
- [3–5 actionable recommendations based on the session findings]
3.2 Merge Recommendation
After summarizing:
- Show the user the list of kept commits and the final metric value.
- Ask whether to merge the best-performing branch into main, or keep as-is.
- If merging: squash or rebase per the user's preference. Never force-push main.
Cross-Platform Compatibility Matrix
| Feature | CUDA (Linux/Windows) | MPS (macOS) | CPU (any OS) |
|---|---|---|---|
torch.compile |
✅ Enabled | ❌ Disabled | ✅ Optional |
| FlashAttention | ✅ If installed | ❌ Use SDPA | ❌ Use SDPA |
| Sliding window attn | ✅ via FA or SDPA | ✅ SDPA + manual mask | ✅ SDPA + manual mask |
| Mixed precision (bf16) | ✅ | ✅ (M2+) | ❌ Use fp32 |
torch.compile modes |
reduce-overhead |
Disabled | reduce-overhead |
| Batch size guidance | Full config | Reduce 2–4× | Reduce 4–8× |
| Optimizer state casting | Not needed | fp32 explicit cast | Not needed |
Allowed Modifications Reference
The agent may only modify the training script (train.py or equivalent). It must never modify:
- The data preparation script (
prepare.pyor equivalent) - The evaluation harness
- The tokenizer or vocabulary files
- Package dependency files
The agent must never install new packages mid-session (all dependencies must already be available).
Integration with Research Team
AutoResearch is designed to plug directly into the research pipeline as a specialized experiment execution layer:
- Receives from
lead-researcher: A hypothesis about a training approach to explore (e.g., "try hierarchical attention with reduced context window"). - Receives from
experiment-design: A structured experiment plan specifying what to vary, what to hold fixed, and what metric to optimize. - Delivers to
principal-scientist: Aresults.tsvandautoresearch-summary.mdwith the best configuration found and all supporting evidence. - Delivers to
research-writing: A structured table of experiments suitable for the methodology and results sections of a paper.
When operating within the research team, the research config (Phase 0) is pre-populated by the lead-researcher or experiment-design teammates rather than entered manually by the user.
Quick-Start Paths
| User intent | Entry point |
|---|---|
| "Run autoresearch overnight on my training script" | Phase 0 (full config) → Phase 1 → Phase 2 loop |
| "I have a hypothesis — test it and keep going" | Phase 0 (pass hypothesis as first iteration direction) → loop |
| "Run this for exactly 20 iterations and report" | Phase 0 (max_iterations=20) → loop → Phase 3 |
| "Continue a previous session" | Phase 1.3 (skip branch creation, re-read existing results.tsv) → Phase 2 |
| "Analyze what happened in last night's session" | Phase 3 only — summarize existing results.tsv |
Output Summary
| Phase | Artifact |
|---|---|
| 0 | Research Config (confirmed by user) |
| 1 | Platform detection log, baseline entry in results.tsv |
| 2 | Growing results.tsv + per-iteration progress reports |
| 3 | autoresearch-summary.md with ranked improvements and recommendations |
More from aviskaar/open-org
cfo-finance
Use this skill when a CFO, VP Finance, Controller, or Head of Finance needs to orchestrate the full financial operations of a company — from strategic financial planning and investor reporting to day-to-day control of accounts payable, accounts receivable, payroll, tax compliance, and revenue operations. This is the top-level financial orchestrator that commissions all finance sub-skills, maintains the single source of truth for all company numbers, drives budget allocation, manages cash flow, ensures regulatory compliance, and produces board-ready financial reports. Trigger this skill when anyone needs a comprehensive view of company finances, a board pack, a fundraising data room, or needs to coordinate across invoicing, payroll, commissions, procurement, taxes, and expenses simultaneously.
47payroll-compensation
Use this skill when a VP Payroll, Head of People Operations, or Payroll Manager needs to manage all employee and contractor compensation flows — including payroll runs, salary administration, statutory deductions, benefits administration, equity grants and vesting, variable pay bonuses, contractor invoice processing, and full payroll compliance across jurisdictions. This skill orchestrates the salary management sub-skill. Trigger when running payroll, onboarding employees with compensation packages, processing salary changes, calculating bonuses, managing equity schedules, processing contractor payments, handling payroll tax filings, or producing total compensation reports for People and Finance leadership.
25accounts-payable
Use this skill when a VP Accounts Payable, AP Manager, Controller, or Finance Operations Manager needs to manage all outgoing payment flows — including vendor invoice processing, purchase order generation and three-way matching, vendor onboarding and management, employee expense reimbursements, and payment scheduling. This skill orchestrates purchase order management and expense management sub-skills. Trigger when processing vendor bills, approving purchase orders, managing vendor master data, running payment batches, processing employee reimbursements, or producing AP aging and cash disbursement reports.
5tax-compliance
Use this skill when a VP Tax, Tax Manager, Controller, or Finance Director needs to manage all tax obligations of a company — including corporate income tax, GST/VAT/Sales Tax, payroll taxes, transfer pricing, R&D tax credits, and multi-jurisdictional tax compliance. Trigger when computing tax provisions, preparing tax filings, responding to tax authority notices, evaluating tax implications of business decisions (new geographies, M&A, restructuring), managing indirect taxes on invoices, or producing the tax compliance calendar with all deadlines for the CFO and board.
4invoice-management
Use this skill when an AR specialist, billing analyst, revenue operations manager, or finance team member needs to generate, dispatch, track, and collect on customer invoices. Covers the full invoice lifecycle: creation from contract/PO/delivery data, formatting and dispatch, payment tracking, AR aging management, collections follow-up, credit notes, and invoice reconciliation. Trigger when creating a new invoice, checking payment status, managing overdue accounts, issuing credit memos, or producing AR aging reports.
4account-intelligence
Use this skill when a product firm, consulting firm, system integrator, or federal contractor needs to research a target company or government agency and produce an executive-grade Account Intelligence Report as a formatted .docx file. Handles any industry vertical — Life Sciences, Financial Services, Healthcare, Manufacturing, Energy, Retail, Technology, Federal/Government, and more. Fully automates the pursuit research and document generation process. Includes AI Agentic Solutions vision, IP and Research Opportunity mapping, and high-definition charts and visual dashboards.
3