skills/oimiragieo/agent-studio/ml-experiment-loop

ml-experiment-loop

Installation
SKILL.md

ML Experiment Loop

When to Use

  • Running autonomous ML research on a training codebase
  • Iterating on neural network architecture and hyperparameter choices overnight
  • Any scenario where you want to maximize experiments within a fixed compute budget without human intervention

Phase 1: Setup (One-Time, Before the Loop)

Complete this phase once before starting the experiment loop.

Step 1.1 — Agree on Run Tag

Propose a run tag based on today's date (e.g., mar14). The branch autoresearch/<tag> must NOT already exist — this is a fresh run.

git branch --list "autoresearch/*"

Step 1.2 — Create the Branch

git checkout -b autoresearch/<tag>

Step 1.3 — Read In-Scope Files

Read these three files for full context before touching anything:

  • README.md — repository context and goals
  • prepare.py — fixed constants, data prep, tokenizer, dataloader, evaluation. DO NOT MODIFY.
  • train.py — the only file you modify. Architecture, optimizer, hyperparameters, training loop.

Step 1.4 — Verify Data Exists

ls ~/.cache/autoresearch/

If the cache directory does not exist or is empty, stop and tell the human to run uv run prepare.py first.

Step 1.5 — Environment Sanity

uv sync

Step 1.6 — Initialize results.tsv

Create results.tsv with just the header row. This file stays untracked by git throughout the run.

echo -e "commit\tval_bpb\tmemory_gb\tstatus\tdescription" > results.tsv

Step 1.7 — Establish Baseline

Your very first run MUST be the unmodified baseline. Do not edit train.py yet. Run the experiment as-is (see Phase 2) to establish the baseline metric. Record it in results.tsv.


Phase 2: The Experiment Loop (LOOP FOREVER)

This loop runs indefinitely until the human manually interrupts it. NEVER ask the human if you should continue. NEVER stop for any reason other than: the human interrupts, or a run crashes beyond repair after multiple fix attempts.

WHILE TRUE:
  1. Look at git state (current branch/commit)
  2. Formulate an experimental hypothesis
  3. Edit train.py
  4. git commit
  5. Run the experiment (redirect ALL output to file)
  6. Extract the metric via grep
  7. Evaluate: crash? improve? equal? worse?
  8. Log to results.tsv
  9. Keep (advance branch) or discard (git reset)
  10. Repeat from step 2

Step 2.1 — Check Git State

git log --oneline -5
git status

Step 2.2 — Formulate a Hypothesis

Pick ONE focused idea to test. Examples:

  • "Increase learning rate from 0.01 to 0.03"
  • "Add gradient clipping at norm 1.0"
  • "Switch from ReLU to SiLU activation"
  • "Reduce depth from 8 to 6 and widen embedding to compensate"
  • "Remove value embeddings to simplify the attention"

If you have run out of obvious ideas:

  • Re-read train.py from scratch for angles you missed
  • Re-read prepare.py for constraints you may not have noticed
  • Try combining two near-miss experiments from results.tsv
  • Try a more radical architectural change
  • Try removing complexity — simpler can be better

You will not ask the human for ideas. You generate ideas yourself.

Step 2.3 — Edit train.py

Apply only the changes needed for this single hypothesis. Keep the diff minimal and reviewable.

Constraints (from prepare.py — cannot change):

  • Training time budget: 5 minutes wall clock (excluding startup/compilation)
  • Sequence length, evaluation protocol, tokenizer
  • evaluate_bpb function — this is the ground truth metric

What you CAN change in train.py:

  • Model architecture (depth, width, attention pattern, activations)
  • Optimizer (type, learning rate, scheduler, momentum)
  • Training loop (batch size, accumulation, warmup)
  • Anything else in train.py

VRAM constraint: Large VRAM increases are acceptable only for meaningful metric gains.

Step 2.4 — Git Commit

git add train.py
git commit -m "experiment: <one-line description of what you changed>"

Step 2.5 — Run the Experiment (CONTEXT-SAFE)

Redirect ALL output to a log file. NEVER let training output stream directly into your context. Streaming training logs will flood your context window and crash the session.

uv run train.py > run.log 2>&1

This will run for approximately 5 minutes. If it has not finished after 10 minutes, kill it:

kill %1   # or kill the process by PID

A 10-minute timeout is treated as a crash — discard and revert.

Step 2.6 — Extract the Metric (TARGETED GREP ONLY)

DO NOT cat run.log. DO NOT tail -n 500 run.log.

Extract only the key metrics:

grep "^val_bpb:\|^peak_vram_mb:" run.log

Expected output when successful:

val_bpb:          0.997900
peak_vram_mb:     45060.2

Step 2.7 — Evaluate the Result

Case A: Crash (grep returned nothing or training errored)

tail -n 50 run.log

Read the Python stack trace. Decide:

  • Trivial fix (typo, missing import, dimension arithmetic error): Fix it, re-commit, re-run once.
  • Fundamentally broken idea (OOM with huge model, logically impossible change): Do not keep trying. Log as crash and revert.

If you cannot fix a crash after 2 attempts, give up on the idea.

Case B: Success (val_bpb improved — lower than current baseline)

Keep the commit. The branch now "advances" — this commit becomes the new baseline.

Update your internal baseline value.

Simplicity criterion: Before keeping a win, weigh it:

  • Improvement of ~0.001 val_bpb + added 20 lines of complex code → probably not worth it
  • Improvement of ~0.001 valbpb from _deleting code → definitely keep
  • Improvement of ~0 but much simpler code → keep (simplification win)
  • Large improvement (>0.005 val_bpb) + reasonable complexity → keep

Case C: No improvement (val_bpb equal or worse)

Discard immediately. Do NOT try to "fix" a bad idea.

git reset --hard HEAD~1

This reverts train.py to the previous baseline commit.

Step 2.8 — Log to results.tsv

Record the experiment. Use TAB separators (not commas — commas break descriptions).

COMMIT=$(git rev-parse --short HEAD)
# Fill in values from the grep output and your decision
echo -e "${COMMIT}\t0.997900\t44.0\tkeep\tincrease LR to 0.04" >> results.tsv

TSV schema:

Column Type Example Notes
commit string a1b2c3d 7-char short hash
val_bpb float 0.997900 Use 0.000000 for crashes
memory_gb float 44.0 peak_vram_mb / 1024, round to 1 decimal. Use 0.0 for crashes
status enum keep keep, discard, or crash
description string increase LR to 0.04 Short text, no tabs

Example results.tsv:

commit val_bpb memory_gb status description
a1b2c3d 0.997900 44.0 keep baseline
b2c3d4e 0.993200 44.2 keep increase LR to 0.04
c3d4e5f 1.005000 44.0 discard switch to GeLU activation
d4e5f6g 0.000000 0.0 crash double model width (OOM)

IMPORTANT: Do NOT git add results.tsv. Leave it untracked. It tracks all experiments across keeps and discards on this branch.


Simplicity Criterion (Decision Framework)

When evaluating whether to keep a change, apply this framework:

Improvement Complexity change Decision
> 0.005 val_bpb lower Reasonable Keep
0.001–0.005 lower Minimal Keep
0.001–0.005 lower Major (20+ lines, hacky) Discard
≈ 0 Simpler (fewer lines) Keep (simplification win)
≈ 0 Equal complexity Discard
0 or worse Any Discard

Goal: the lowest val_bpb in the cleanest code. Complexity is a debt that compounds.


Idea Generation (When Stuck)

If you've exhausted your idea backlog, work through these categories:

  1. Learning rate and schedule — try warmup, cosine decay, different peak LR
  2. Architecture depth vs. width — trade depth for width, or vice versa
  3. Attention patterns — local/global windowed attention, number of KV heads
  4. Optimizer — Muon vs. AdamW vs. hybrid, momentum coefficients
  5. Normalization — RMSNorm placement, pre/post norm
  6. Activation functions — SiLU, GeLU, ReGLU
  7. Batch size and gradient accumulation — total batch size vs. micro-batch size
  8. Simplification — remove features that might be hurting (e.g., value embeddings)
  9. Combination — combine two previously near-miss experiments
  10. Radical changes — double depth, halve width, completely different architecture style

Re-reading train.py and prepare.py from scratch often surfaces new angles.


Iron Laws

  1. NEVER STOP the loop to ask the human for permission — the human is likely asleep. Run indefinitely until interrupted.
  2. ALWAYS redirect training output to a fileuv run train.py > run.log 2>&1. Streaming output floods context and crashes the session.
  3. NEVER cat or tail -n 500 the run log — only use targeted grep "^val_bpb:\|^peak_vram_mb:" run.log for metrics.
  4. NEVER modify prepare.py — it is read-only. The evaluation protocol is fixed.
  5. ALWAYS git reset on discard — revert immediately with git reset --hard HEAD~1. Never try to iterate on a failed idea.
  6. ALWAYS keep results.tsv untracked — never git add results.tsv. It records all experiments across the branch.
  7. NEVER install new packages — only use what's in pyproject.toml.

Anti-Patterns

Anti-Pattern Why It Fails Correct Approach
cat run.log or tail -n 500 run.log Floods context with gigabytes of training logs; crashes session grep "^val_bpb:|^peak_vram_mb:" run.log only
Asking "should I keep going?" Human is likely asleep; defeats the purpose of autonomous research NEVER STOP. Continue indefinitely until manually interrupted
Trying to "fix" a failed hypothesis Most bad ideas are fundamentally wrong, not implementation bugs git reset --hard HEAD~1 and move to next idea
Running multiple hypotheses in one experiment Impossible to attribute wins or losses to specific changes One hypothesis per commit, one commit per experiment
Modifying prepare.py Corrupts evaluation protocol; results become incomparable Never touch prepare.py. It is fixed.
Forgetting to redirect output Training stdout floods agent context mid-experiment Always uv run train.py > run.log 2>&1
git add results.tsv Clutters git history; results span all experiments including discards Never track results.tsv in git
Using commas in results.tsv Commas inside description field break CSV parsers Always use TAB separators in results.tsv
Waiting 10+ minutes for a stuck run OOM or infinite loops hang silently Kill any run exceeding 10 minutes; treat as crash
Keeping a tiny win with major complexity added Complexity accumulates; future experiments suffer Apply simplicity criterion: tiny gain + ugly code = discard

Memory Protocol (MANDATORY)

Before starting:

node .claude/lib/memory/memory-search.cjs "ml experiment loop autonomous training"

Read .claude/context/memory/learnings.md

After completing a session:

  • Winning patterns discovered → .claude/context/memory/learnings.md
  • Crash causes and workarounds → .claude/context/memory/issues.md
  • Architectural decisions → .claude/context/memory/decisions.md

ASSUME INTERRUPTION: If it's not in memory or results.tsv, it didn't happen.

Related Skills

  • ai-ml-expert — Deep PyTorch and ML domain knowledge for hypothesis generation
  • modern-python — uv/ruff/ty tooling for Python project management
  • git-expert — Advanced git operations for branch and reset workflows
Weekly Installs
3
GitHub Stars
25
First Seen
Mar 22, 2026