ML Experiment Loop

When to Use

Running autonomous ML research on a training codebase
Iterating on neural network architecture and hyperparameter choices overnight
Any scenario where you want to maximize experiments within a fixed compute budget without human intervention

Phase 1: Setup (One-Time, Before the Loop)

Complete this phase once before starting the experiment loop.

Step 1.1 — Agree on Run Tag

Propose a run tag based on today's date (e.g., mar14). The branch autoresearch/<tag> must NOT already exist — this is a fresh run.

git branch --list "autoresearch/*"

Step 1.2 — Create the Branch

git checkout -b autoresearch/<tag>

Step 1.3 — Read In-Scope Files

Read these three files for full context before touching anything:

README.md — repository context and goals
prepare.py — fixed constants, data prep, tokenizer, dataloader, evaluation. DO NOT MODIFY.
train.py — the only file you modify. Architecture, optimizer, hyperparameters, training loop.

Step 1.4 — Verify Data Exists

ls ~/.cache/autoresearch/

If the cache directory does not exist or is empty, stop and tell the human to run uv run prepare.py first.

Step 1.5 — Environment Sanity

uv sync

Step 1.6 — Initialize results.tsv

Create results.tsv with just the header row. This file stays untracked by git throughout the run.

echo -e "commit\tval_bpb\tmemory_gb\tstatus\tdescription" > results.tsv

Step 1.7 — Establish Baseline

Your very first run MUST be the unmodified baseline. Do not edit train.py yet. Run the experiment as-is (see Phase 2) to establish the baseline metric. Record it in results.tsv.

Phase 2: The Experiment Loop (LOOP FOREVER)

This loop runs indefinitely until the human manually interrupts it. NEVER ask the human if you should continue. NEVER stop for any reason other than: the human interrupts, or a run crashes beyond repair after multiple fix attempts.

WHILE TRUE:
  1. Look at git state (current branch/commit)
  2. Formulate an experimental hypothesis
  3. Edit train.py
  4. git commit
  5. Run the experiment (redirect ALL output to file)
  6. Extract the metric via grep
  7. Evaluate: crash? improve? equal? worse?
  8. Log to results.tsv
  9. Keep (advance branch) or discard (git reset)
  10. Repeat from step 2

Step 2.1 — Check Git State

git log --oneline -5
git status

Step 2.2 — Formulate a Hypothesis

Pick ONE focused idea to test. Examples:

"Increase learning rate from 0.01 to 0.03"
"Add gradient clipping at norm 1.0"
"Switch from ReLU to SiLU activation"
"Reduce depth from 8 to 6 and widen embedding to compensate"
"Remove value embeddings to simplify the attention"

If you have run out of obvious ideas:

Re-read train.py from scratch for angles you missed
Re-read prepare.py for constraints you may not have noticed
Try combining two near-miss experiments from results.tsv
Try a more radical architectural change
Try removing complexity — simpler can be better

You will not ask the human for ideas. You generate ideas yourself.

Step 2.3 — Edit `train.py`

Apply only the changes needed for this single hypothesis. Keep the diff minimal and reviewable.

Constraints (from prepare.py — cannot change):

Training time budget: 5 minutes wall clock (excluding startup/compilation)
Sequence length, evaluation protocol, tokenizer
evaluate_bpb function — this is the ground truth metric

What you CAN change in train.py:

Model architecture (depth, width, attention pattern, activations)
Optimizer (type, learning rate, scheduler, momentum)
Training loop (batch size, accumulation, warmup)
Anything else in train.py

VRAM constraint: Large VRAM increases are acceptable only for meaningful metric gains.

Step 2.4 — Git Commit

git add train.py
git commit -m "experiment: <one-line description of what you changed>"

Step 2.5 — Run the Experiment (CONTEXT-SAFE)

Redirect ALL output to a log file. NEVER let training output stream directly into your context. Streaming training logs will flood your context window and crash the session.

uv run train.py > run.log 2>&1

This will run for approximately 5 minutes. If it has not finished after 10 minutes, kill it:

kill %1   # or kill the process by PID

A 10-minute timeout is treated as a crash — discard and revert.

Step 2.6 — Extract the Metric (TARGETED GREP ONLY)

DO NOT cat run.log. DO NOT tail -n 500 run.log.

Extract only the key metrics:

grep "^val_bpb:\|^peak_vram_mb:" run.log

Expected output when successful:

val_bpb:          0.997900
peak_vram_mb:     45060.2

Step 2.7 — Evaluate the Result

Case A: Crash (grep returned nothing or training errored)

tail -n 50 run.log

Read the Python stack trace. Decide:

Trivial fix (typo, missing import, dimension arithmetic error): Fix it, re-commit, re-run once.
Fundamentally broken idea (OOM with huge model, logically impossible change): Do not keep trying. Log as crash and revert.

If you cannot fix a crash after 2 attempts, give up on the idea.

Case B: Success (`val_bpb` improved — lower than current baseline)

Keep the commit. The branch now "advances" — this commit becomes the new baseline.

Update your internal baseline value.

Simplicity criterion: Before keeping a win, weigh it:

Improvement of ~0.001 val_bpb + added 20 lines of complex code → probably not worth it
Improvement of ~0.001 valbpb from _deleting code → definitely keep
Improvement of ~0 but much simpler code → keep (simplification win)
Large improvement (>0.005 val_bpb) + reasonable complexity → keep

Case C: No improvement (`val_bpb` equal or worse)

Discard immediately. Do NOT try to "fix" a bad idea.

git reset --hard HEAD~1

This reverts train.py to the previous baseline commit.

Step 2.8 — Log to results.tsv

Record the experiment. Use TAB separators (not commas — commas break descriptions).

COMMIT=$(git rev-parse --short HEAD)
# Fill in values from the grep output and your decision
echo -e "${COMMIT}\t0.997900\t44.0\tkeep\tincrease LR to 0.04" >> results.tsv

TSV schema:

Column	Type	Example	Notes
commit	string	`a1b2c3d`	7-char short hash
val_bpb	float	`0.997900`	Use `0.000000` for crashes
memory_gb	float	`44.0`	`peak_vram_mb / 1024`, round to 1 decimal. Use `0.0` for crashes
status	enum	`keep`	`keep`, `discard`, or `crash`
description	string	`increase LR to 0.04`	Short text, no tabs

Example results.tsv:

commit val_bpb memory_gb status description
a1b2c3d 0.997900 44.0 keep baseline
b2c3d4e 0.993200 44.2 keep increase LR to 0.04
c3d4e5f 1.005000 44.0 discard switch to GeLU activation
d4e5f6g 0.000000 0.0 crash double model width (OOM)

IMPORTANT: Do NOT git add results.tsv. Leave it untracked. It tracks all experiments across keeps and discards on this branch.

Simplicity Criterion (Decision Framework)

When evaluating whether to keep a change, apply this framework:

Improvement	Complexity change	Decision
> 0.005 val_bpb lower	Reasonable	Keep
0.001–0.005 lower	Minimal	Keep
0.001–0.005 lower	Major (20+ lines, hacky)	Discard
≈ 0	Simpler (fewer lines)	Keep (simplification win)
≈ 0	Equal complexity	Discard
0 or worse	Any	Discard

Goal: the lowest val_bpb in the cleanest code. Complexity is a debt that compounds.

Idea Generation (When Stuck)

If you've exhausted your idea backlog, work through these categories:

Learning rate and schedule — try warmup, cosine decay, different peak LR
Architecture depth vs. width — trade depth for width, or vice versa
Attention patterns — local/global windowed attention, number of KV heads
Optimizer — Muon vs. AdamW vs. hybrid, momentum coefficients
Normalization — RMSNorm placement, pre/post norm
Activation functions — SiLU, GeLU, ReGLU
Batch size and gradient accumulation — total batch size vs. micro-batch size
Simplification — remove features that might be hurting (e.g., value embeddings)
Combination — combine two previously near-miss experiments
Radical changes — double depth, halve width, completely different architecture style

Re-reading train.py and prepare.py from scratch often surfaces new angles.

Iron Laws

NEVER STOP the loop to ask the human for permission — the human is likely asleep. Run indefinitely until interrupted.
ALWAYS redirect training output to a file — uv run train.py > run.log 2>&1. Streaming output floods context and crashes the session.
NEVER cat or tail -n 500 the run log — only use targeted grep "^val_bpb:\|^peak_vram_mb:" run.log for metrics.
NEVER modify prepare.py — it is read-only. The evaluation protocol is fixed.
ALWAYS git reset on discard — revert immediately with git reset --hard HEAD~1. Never try to iterate on a failed idea.
ALWAYS keep results.tsv untracked — never git add results.tsv. It records all experiments across the branch.
NEVER install new packages — only use what's in pyproject.toml.

Anti-Patterns

Anti-Pattern	Why It Fails	Correct Approach
`cat run.log` or `tail -n 500 run.log`	Floods context with gigabytes of training logs; crashes session	`grep "^val_bpb:\|^peak_vram_mb:" run.log` only
Asking "should I keep going?"	Human is likely asleep; defeats the purpose of autonomous research	NEVER STOP. Continue indefinitely until manually interrupted
Trying to "fix" a failed hypothesis	Most bad ideas are fundamentally wrong, not implementation bugs	`git reset --hard HEAD~1` and move to next idea
Running multiple hypotheses in one experiment	Impossible to attribute wins or losses to specific changes	One hypothesis per commit, one commit per experiment
Modifying `prepare.py`	Corrupts evaluation protocol; results become incomparable	Never touch `prepare.py`. It is fixed.
Forgetting to redirect output	Training stdout floods agent context mid-experiment	Always `uv run train.py > run.log 2>&1`
`git add results.tsv`	Clutters git history; results span all experiments including discards	Never track results.tsv in git
Using commas in results.tsv	Commas inside description field break CSV parsers	Always use TAB separators in results.tsv
Waiting 10+ minutes for a stuck run	OOM or infinite loops hang silently	Kill any run exceeding 10 minutes; treat as crash
Keeping a tiny win with major complexity added	Complexity accumulates; future experiments suffer	Apply simplicity criterion: tiny gain + ugly code = discard

Memory Protocol (MANDATORY)

Before starting:

node .claude/lib/memory/memory-search.cjs "ml experiment loop autonomous training"

Read .claude/context/memory/learnings.md

After completing a session:

Winning patterns discovered → .claude/context/memory/learnings.md
Crash causes and workarounds → .claude/context/memory/issues.md
Architectural decisions → .claude/context/memory/decisions.md

ASSUME INTERRUPTION: If it's not in memory or results.tsv, it didn't happen.

Related Skills

ai-ml-expert — Deep PyTorch and ML domain knowledge for hypothesis generation
modern-python — uv/ruff/ty tooling for Python project management
git-expert — Advanced git operations for branch and reset workflows

ml-experiment-loop