run-experiment
Run Experiment
Submit an ML experiment to a compute environment — local machine, SLURM HPC (Ibex, UW, etc.), or RunAI/Kubernetes (EPFL).
Generates a reproducible job script in jobs/ that is committed alongside the code, then provides the exact submit command to run.
Pair this skill with research-project-memory when a launched job should be linked to a planned experiment, evidence item, worktree, or project action.
Pair this skill with project toolchain gates when generated job scripts should be checked before launch. Use shellcheck and shfmt when available, but do not require the user to install them just to generate a draft script.
Terminology:
local: the machine where the agent is running, usually the user's MacGit remote: GitHub/GitLab remote used for code sync, such asoriginserver: SSH/HPC/RunAI execution environment such asquest,ibex-vscode, orepfl-haas
When launching on SLURM or RunAI, call it a server run rather than a remote run unless referring to the Git remote.
Skill Directory Layout
<installed-skill-dir>/
├── SKILL.md
├── environments.yaml # Environment profiles (extend for new clusters)
└── templates/
├── slurm_job.sh # SLURM template (Ibex, UW, any SLURM cluster)
├── runai_job.sh # RunAI/Kubernetes template (EPFL)
└── local_run.sh # Local tmux/nohup template
Steps to Follow
1. Read the environment registry
Resolve <installed-skill-dir> as the directory containing this SKILL.md, then read <installed-skill-dir>/environments.yaml.
List the available environments to the user with a one-line description each.
2. Ask for experiment details
Ask the user in a single message:
- Environment: Which compute env? (show available choices from environments.yaml + "other")
- Script / command: What command to run? (e.g.,
uv run python train.py --lr 1e-3) - Job name: Short identifier (e.g.,
baseline-cifar10,ablation-no-attn). Default: script basename + date. - GPU count: How many GPUs? (default from env profile, 0 for CPU-only)
- Walltime / time limit: (SLURM only) How long? (default from env profile)
- Conda env or venv: Name of the conda environment, or
.venvpath (if applicable) - Output directory: Where to save checkpoints/results? (default:
outputs/<job-name>/) - Anything special?: Extra env vars, array job, specific GPU type, PVC mounts (RunAI), etc.
If --env, --script, --name, or --gpus were passed as arguments, pre-fill those answers.
3. Locate the project root
git rev-parse --show-toplevel 2>/dev/null || pwd
Also capture the short git commit hash:
git rev-parse --short HEAD 2>/dev/null || echo "no-git"
4. Generate the job script
Based on the environment type:
type: slurm (ibex, uw, or any SLURM cluster)
Read the SLURM template from <installed-skill-dir>/templates/slurm_job.sh.
Fill in all {PLACEHOLDER} variables:
| Placeholder | Value |
|---|---|
{PROJECT} |
project directory name |
{ENV_NAME} |
environment key (e.g., ibex) |
{ENV_DISPLAY} |
display name from profile |
{DATE} |
today's date YYYY-MM-DD |
{COMMIT} |
short git SHA |
{JOB_NAME} |
user-provided job name |
{SCRIPT_NAME} |
filename of the generated script |
{PARTITION} |
from env profile defaults (or user override) |
{CPUS} |
cpus_per_task from profile (or user override) |
{GPUS} |
user-provided GPU count |
{MEM} |
from profile defaults (or user override) |
{WALLTIME} |
user-provided or profile default |
{LOG_DIR} |
outputs/logs/<job-name> |
{OUTPUT_DIR} |
outputs/<job-name> |
{PROJECT_ROOT} |
absolute project root path |
{CONDA_ENV} |
user-provided env name |
{RUN_COMMAND} |
user-provided command |
{SCRATCH} |
scratch path from env profile |
Uncomment the relevant module load lines based on the env profile's common_modules.
Uncomment the conda activate or source .venv/activate line based on user's answer.
If scratch path is in the env profile, uncomment the TMPDIR block.
Output path: jobs/<job-name>.sh
type: runai (runai profile)
Read the RunAI template from <installed-skill-dir>/templates/runai_job.sh.
Fill in placeholders:
| Placeholder | Value |
|---|---|
{PROJECT} |
project directory name |
{DATE} |
today's date |
{COMMIT} |
short git SHA |
{JOB_NAME} |
user-provided job name |
{SCRIPT_NAME} |
filename of generated script |
{IMAGE} |
from env profile default_image (ask user to confirm) |
{RUNAI_PROJECT} |
from env profile project |
{GPUS} |
GPU count |
{CPUS} |
CPU count from profile defaults |
{MEM} |
memory from profile defaults |
{PVC_FLAGS} |
generated from pvc_mounts in profile: --pvc claim:path \ per mount |
{RUN_COMMAND} |
user-provided command |
Output path: jobs/<job-name>-runai.sh
type: local
Read the local template from <installed-skill-dir>/templates/local_run.sh.
Fill in placeholders similarly. Uncomment conda/venv activation as appropriate.
Output path: jobs/<job-name>-local.sh
type: other / unknown
If the user specifies an environment not in environments.yaml:
- Ask: "What scheduler does it use? (slurm / runai / other)"
- If SLURM-compatible: use the SLURM template with the info the user provides.
- If truly novel: generate a minimal generic wrapper and explain what to fill in.
- Suggest: "Want me to add this environment to
environments.yamlfor future use?"
5. Write the job script and preview
Create the job script directory, log directory, and output directory before previewing or submitting:
mkdir -p <project-root>/jobs
mkdir -p <project-root>/outputs/logs/<job-name>
mkdir -p <output-dir>
Write the filled-in script to jobs/<job-name>.sh (or -runai.sh / -local.sh).
Show the user the full generated script for review.
5.1 Check generated shell script gates
After writing the job script and before offering to launch it, run non-mutating shell gates when the tools are available:
command -v shellcheck >/dev/null && shellcheck jobs/<job-name>.sh || true
command -v shfmt >/dev/null && shfmt -d jobs/<job-name>.sh || true
bash -n jobs/<job-name>.sh
Report whether each gate passed, failed, or was skipped because the tool is not installed. Treat bash -n failures as blockers before launch. Treat shellcheck and shfmt -d failures as warnings unless the project policy marks them required.
Do not run shfmt -w silently. If formatting is requested or required by policy, run shfmt -w jobs/<job-name>.sh and review the diff before submitting.
6. Show the submit command and ask to launch
Print the exact command(s) to submit, tailored to the environment:
SLURM (Ibex / UW / etc.)
# If you're already on the login node:
sbatch jobs/<job-name>.sh
# If submitting from your local machine to a server (requires ssh access):
scp jobs/<job-name>.sh <ssh-alias>:<project-root>/jobs/
ssh <ssh-alias> "cd <project-root> && mkdir -p outputs/logs/<job-name> <output-dir> jobs && sbatch jobs/<job-name>.sh"
# Monitor:
squeue -u $USER
sacct -j <jobid> --format=JobID,State,Elapsed,AllocGRES
tail -f outputs/logs/<job-name>/slurm-<jobid>.out
RunAI
bash jobs/<job-name>-runai.sh
# Monitor:
runai list
runai logs <job-name> -f
Local
# Attached (output in terminal):
bash jobs/<job-name>-local.sh
# Detached in tmux:
tmux new-session -d -s <job-name> "bash jobs/<job-name>-local.sh"
tmux attach -t <job-name>
# Background with nohup:
nohup bash jobs/<job-name>-local.sh &
Ask: "Want me to run the submit command now?"
- If yes and local: run it directly.
- If yes and server: run the
scp+ssh sbatchcommand (requires ssh key auth to be set up). - If no: remind the user that the script is saved in
jobs/and ready to submit.
7. Offer to add to jobs index (optional)
If a jobs/README.md or jobs/index.md exists, offer to append a one-line entry:
| {DATE} | {JOB_NAME} | {ENV_NAME} | {COMMIT} | {RUN_COMMAND_BRIEF} |
If the repo follows the code evidence layout from init-python-project, also offer to create or update a short run pointer under:
docs/runs/<DATE>-<job-name>.md
This file should contain the command, config, commit, output path, expected metric, and monitor command. It should not contain raw logs.
8. Update project memory when present
If the repo has memory/ or a worktree .agent/worktree-status.md, update only verified run pointers:
memory/evidence-board.md: add or update the linkedEXP-###with job script path, commit, command, output directory, and statusplanned,submitted, orrunningonly if verifiedmemory/provenance-board.md: add only planned or available provenance for run pointers; do not mark final metrics as verified until outputs are checkeddocs/runs/: write a small run pointer when the code repo uses that conventionmemory/action-board.md: mark the launch action asdoingor create a monitor actionmemory/handoff-board.md: create a monitor/fetch/report handoff only when another module is expected to consume the run outputmemory/current-status.md: record the latest known job and what must be checked next<worktree>/.agent/worktree-status.md: link the run to the worktree purpose and exit condition
Do not store queue state, job success, or final metric values as durable facts unless they were verified in this session. Use needs-verification for monitor tasks.
Environment Reference
All environments are defined in environments.yaml. The current known environments:
| Key | Type | Cluster | Notes |
|---|---|---|---|
local |
local | — | Current machine, tmux/nohup |
ibex |
slurm | KAUST Ibex | ilogin.ibex.kaust.edu.sa; gpu/batch/himem partitions |
uw |
slurm | UW HPC | Placeholder — update environments.yaml with actual details |
runai |
runai | EPFL RunAI | Kubernetes; update project/image in environments.yaml |
Adding a New Environment
Edit <installed-skill-dir>/environments.yaml and add a block:
my-cluster:
type: slurm # or runai / local
display_name: "My University HPC"
login_node: "login.cluster.edu"
ssh_alias: mycluster
scheduler: slurm
partitions:
gpu:
name: gpu
flag: "--partition=gpu"
gpu_flag: "--gres=gpu:{count}"
max_gpus_per_job: 4
defaults:
partition: gpu
gpus: 1
cpus_per_task: 4
mem: "32G"
walltime: "12:00:00"
max_walltime: "48:00:00"
storage:
home: "/home/{user}"
scratch: "/scratch/{user}"
module_system: lmod
common_modules:
- "cuda/12.1"
- "python/3.11"
notes: "..."
Reproducibility Conventions
Every generated job script includes:
- Git commit hash in the header and as an env var (
GIT_COMMIT) - Structured output directory:
outputs/<job-name>/for checkpoints,outputs/logs/<job-name>/for logs - Timestamped log files so reruns don't overwrite
- Exit code propagation so job arrays and downstream scripts detect failures
The jobs/ directory should be committed to git (the scripts are small text files). Actual outputs go to outputs/ which is typically .gitignored.
Example Invocations
/run-experiment # interactive wizard
/run-experiment --env ibex --script train.py --gpus 2
/run-experiment --env local --script eval.py --name eval-baseline
/run-experiment --env runai --gpus 4 --name big-run
/run-experiment --env ibex --script sweep.py --name sweep --gpus 1
Common Patterns
Job Array (SLURM) — hyperparameter sweep
When the user says "I want to sweep over N configs":
- Ask for the sweep configs or config file (e.g.,
configs/sweep.yamlwith N entries). - Add
#SBATCH --array=0-{N-1}%{max_concurrent}to the script. - Add to the run command:
--config configs/sweep.yaml --config-idx $SLURM_ARRAY_TASK_ID - Output dir:
outputs/<job-name>/$SLURM_ARRAY_TASK_ID/
Multi-GPU (DDP)
When GPUs > 1 and the env is SLURM:
- Add
--ntasks-per-node={GPUS}directive - Wrap command with
torchrun --nproc_per_node={GPUS}orsrun python -m torch.distributed.launch - Ask the user which distributed launcher they use
Interactive Session (Debugging)
When the user wants to debug interactively (not submit a batch job):
Ibex:
srun --partition=gpu --gres=gpu:1 --cpus-per-task=4 --mem=32G --time=2:00:00 --pty bash
RunAI:
runai submit <name> --image <image> --gpu 1 --interactive --stdin -- bash
runai bash <name>
Generate this command directly without creating a script file.
More from a-green-hand-jack/ml-research-skills
project-init
Initialize an ML research project control root. Use for paper/code/slides repos, shared memory, GitHub Project alignment, agent guidance, worktree policy, and lifecycle handoffs.
37project-sync
Sync verified code-side experiment results into paper memory. Use when logs, reports, run docs, or user-confirmed metrics should become paper-facing evidence.
36add-git-tag
Create annotated Git milestone tags. Use when completing a phase, releasing a version, or marking a research checkpoint.
36update-docs
Refresh project documentation after code changes. Use after implementing features, changing behavior, or preparing a milestone commit.
36init-latex-project
Initialize a LaTeX academic paper project. Use for new conference or journal papers needing templates, macros, venue preambles, and writing guidance.
36new-workspace
Create Git branches or worktrees for research code and paper versions. Use for experiments, baselines, rebuttal fixes, arXiv/camera-ready branches, and worktree memory.
36