autoresearch
Autoresearch
Set up and run Andrej Karpathy's autoresearch — an autonomous AI research loop where an AI agent iterates on a tiny language model's training code overnight, running ~100 experiments while you sleep.
Concept
Human writes program.md (the "research org" instructions)
↓
AI agent reads program.md
↓
Agent modifies train.py (model, optimizer, hyperparameters)
↓
Runs 5-minute training on GPU → measures val_bpb
↓
If improved → git commit (keep). If not → git revert (discard).
↓
Repeat (~12 experiments/hour, ~100 overnight)
Key insight: You don't touch any Python files. Instead, you program program.md —
the Markdown instructions that guide the AI agent. You're programming the research
organization, not running individual experiments.
Prerequisites
Hardware
| Platform | Requirement |
|---|---|
| Mac (recommended for this skill) | Apple Silicon (M1/M2/M3/M4), 16 GB RAM minimum (32 GB+ better) |
| Linux/Windows | NVIDIA GPU (RTX 3060+), CUDA toolkit installed |
Check your Mac chip: Apple menu → About This Mac → look for "Chip: M1/M2/M3/M4". Any Mac bought since late 2020 has Apple Silicon.
Software
| Tool | Purpose | Check |
|---|---|---|
| Git | Experiment tracking (save points) | git --version |
| uv | Python + dependency manager | uv --version |
| Claude Code (or Cursor/Codex) | The AI agent brain | claude --version |
Setup
Step 1: Install prerequisites
# Install uv (handles Python + all dependencies automatically)
curl -LsSf https://astral.sh/uv/install.sh | sh
# IMPORTANT: Close and reopen your terminal after installing uv
# Verify
uv --version
git --version
Step 2: Clone the repo
Mac (Apple Silicon):
cd ~/Desktop
git clone https://github.com/miolini/autoresearch-macos.git
cd autoresearch-macos
Linux/Windows (NVIDIA GPU):
cd ~/Desktop
git clone https://github.com/karpathy/autoresearch.git
cd autoresearch
About the Mac fork: Karpathy himself links to
miolini/autoresearch-macosfrom his README. The developer (Artem Andreenko) has 167 public projects on GitHub and a years-long track record. The fork swaps FlashAttention-3 for PyTorch's built-in SDPA and adds Apple Metal/MPS optimizations. The entire codebase is ~630 lines — fully auditable in 20 minutes.
Step 3: Install dependencies and prepare data
# Install Python + all packages
uv sync
# Download training data + build tokenizer (one-time, ~2 min)
uv run prepare.py
# Run one test training to verify setup (~5 min)
uv run train.py
If the test training finishes and shows a val_bpb score — you're ready.
Step 4: Launch the autonomous loop
# Navigate to the project
cd ~/Desktop/autoresearch-macos # or autoresearch on Linux
# Launch Claude Code
claude
Then type this prompt:
Hi have a look at program.md and let's kick off a new experiment! Let's do the setup first.
That's it. Minimize the window and go to sleep. The agent will:
- Read
program.md - Modify
train.pywith an experimental change - Run a 5-minute training
- Check
val_bpb— if improved, git commit; if not, git revert - Repeat all night
Pro tip: To make it fully autonomous, tell the agent upfront: "Run fully autonomously. Don't ask for confirmation between experiments. Keep going until I come back."
Project Structure
autoresearch/
├── prepare.py ← Constants, data prep, runtime utilities (DO NOT modify)
├── train.py ← Model + optimizer + training loop (agent modifies this)
├── program.md ← Agent instructions (human modifies this)
├── pyproject.toml ← Dependencies
├── results.tsv ← Experiment log (score, memory, kept/discarded)
└── analysis.ipynb ← Graphs showing progress over time
The three files that matter
| File | Modified by | Purpose |
|---|---|---|
prepare.py |
Nobody | Fixed constants, one-time data prep, runtime utilities |
train.py |
AI agent | GPT model, Muon + AdamW optimizer, training loop. Everything is fair game: architecture, hyperparameters, optimizer, batch size, etc. |
program.md |
Human | Instructions for the AI agent. This is your leverage point — better instructions → faster research progress. |
Key Terminology
| Term | Meaning |
|---|---|
| val_bpb | Validation bits per byte — the score measuring model quality. Lower = better. Vocab-size-independent so architectural changes are fairly compared. |
| train.py | The single Python file containing all training code. The AI agent modifies only this file during experiments. |
| program.md | Your instruction file for the AI agent. The only file you (the human) should edit. Think of it as a mission briefing for your tireless lab assistant. |
| 5-minute budget | Every experiment gets exactly 5 minutes of training time. Makes experiments directly comparable regardless of what the agent changes. ~12 experiments/hour. |
Design Choices
- Single file to modify. The agent only touches
train.py. Keeps scope manageable and diffs reviewable. - Fixed time budget. Training always runs for exactly 5 minutes, regardless of platform. This makes experiments directly comparable and means autoresearch finds the most optimal model for your specific hardware.
- Self-contained. No external dependencies beyond PyTorch. No distributed training, no complex configs. One GPU, one file, one metric.
Tips for Best Results
-
Start simple. Get one manual
uv run train.pyworking first. If that doesn't work, the autonomous loop won't either. -
Your one job is to improve
program.md. Add instructions like:- "Try small improvements first"
- "Focus on making val_bpb go down"
- "Think step by step and explain every change before making it"
- "If an experiment direction hasn't worked after 3 attempts, try something completely different"
-
Don't panic when experiments fail. Most will not improve the score. Out of 100 overnight experiments, maybe 10–20 are keepers. This is normal — the agent automatically keeps wins and discards losses.
-
Check in periodically at first. Watch the first 3–4 experiments to make sure the loop is working before going AFK.
-
More memory helps. 32 GB+ unified memory on Mac lets the agent explore larger models and more complex architectures.
Tuning for Smaller Hardware
If running on a Mac with limited memory, Karpathy recommends these adjustments (ask the agent to make them, or edit train.py/prepare.py yourself):
| Setting | Location | Default | Smaller Hardware |
|---|---|---|---|
| Dataset | prepare.py |
FineWeb-Edu | Use TinyStories for better results at small scale |
vocab_size |
prepare.py |
8192 | Try 4096, 2048, 1024, or even 256 (byte-level) |
MAX_SEQ_LEN |
prepare.py |
Large | Lower significantly, even down to 256 |
DEVICE_BATCH_SIZE |
train.py |
Default | Increase slightly as you lower MAX_SEQ_LEN |
EVAL_TOKENS |
prepare.py |
Default | Decrease so validation runs faster |
DEPTH |
train.py |
8 | Lower to 4 for smaller models |
WINDOW_PATTERN |
train.py |
"SSSL" | Use just "L" (alternating banded attention may be inefficient) |
TOTAL_BATCH_SIZE |
train.py |
Default | Lower to 2**14 (~16K) or smaller |
Checking Results
After a night of experiments:
# See the git history of successful experiments
git log --oneline
# Check the results log
cat results.tsv
# Open the analysis notebook (optional)
# Use Jupyter or Cursor to view analysis.ipynb
You'll find:
- Git history — each commit is a successful experiment that improved val_bpb
- Lower val_bpb — the model is genuinely smarter (baseline starts around 0.9979)
- Modified train.py — architecture tweaks, optimizer changes, hyperparameter adjustments
- results.tsv — every experiment with score, memory usage, and keep/discard status
Alternative Agent Options
| Agent | Cost | Best For |
|---|---|---|
| Claude Code | $20/mo (Pro) or $100/mo (Max) | Full autopilot — runs entirely in Terminal |
| Cursor | Free tier available, $20/mo Pro | Visual learners — AI chat panel + file editor |
| Codex CLI | Varies | Alternative to Claude Code |
| Claude.ai chat | Free/$20/mo | Manual only — copy-paste results back and forth |
Troubleshooting
| Problem | Fix |
|---|---|
command not found: uv |
Close terminal, open a new one |
command not found: git |
Mac: install Xcode CLI tools. Linux: sudo apt install git |
| CUDA / GPU error (Linux/Windows) | Search "install CUDA toolkit [your GPU]" |
| MPS / Metal error (Mac) | Make sure you cloned miolini/autoresearch-macos, not the original |
| Out of memory | GPU needs more VRAM. Agent usually adapts automatically. See tuning table above. |
| Claude Code auth error | Requires paid Claude subscription ($20/mo minimum) |
| Test training works but loop doesn't start | Make sure you're in the right folder when launching claude. Be explicit in your prompt. |
References
| Resource | Link |
|---|---|
| Original repo (NVIDIA) | karpathy/autoresearch |
| Mac fork (Apple Silicon) | miolini/autoresearch-macos |
| MLX fork (Mac) | trevin-creator/autoresearch-mlx |
| Windows fork (RTX) | jsegov/autoresearch-win-rtx |
| Karpathy's announcement | Tweet |
| Karpathy's update | Tweet |
| TinyStories dataset | HuggingFace |
| uv package manager | astral.sh/uv |