run-experiment

SKILL.md

Run Experiment

Deploy and run ML experiment: $ARGUMENTS

Workflow

Step 1: Detect Environment

Read the project's CLAUDE.md to determine the experiment environment:

  • Local GPU: Look for local CUDA/MPS setup info
  • Remote server: Look for SSH alias, conda env, code directory

If no server info is found in CLAUDE.md, ask the user.

Step 2: Pre-flight Check

Check GPU availability on the target machine:

Remote:

ssh <server> nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv,noheader

Local:

nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv,noheader
# or for Mac MPS:
python -c "import torch; print('MPS available:', torch.backends.mps.is_available())"

Free GPU = memory.used < 500 MiB.

Step 3: Sync Code (Remote Only)

Only sync necessary files — NOT data, checkpoints, or large files:

rsync -avz --include='*.py' --exclude='*' <local_src>/ <server>:<remote_dst>/

Step 4: Deploy

Remote (via SSH + screen)

For each experiment, create a dedicated screen session with GPU binding:

ssh <server> "screen -dmS <exp_name> bash -c '\
  eval \"\$(<conda_path>/conda shell.bash hook)\" && \
  conda activate <env> && \
  CUDA_VISIBLE_DEVICES=<gpu_id> python <script> <args> 2>&1 | tee <log_file>'"

Local

# Linux with CUDA
CUDA_VISIBLE_DEVICES=<gpu_id> python <script> <args> 2>&1 | tee <log_file>

# Mac with MPS (PyTorch uses MPS automatically)
python <script> <args> 2>&1 | tee <log_file>

For local long-running jobs, use run_in_background: true to keep the conversation responsive.

Step 5: Verify Launch

Remote:

ssh <server> "screen -ls"

Local: Check process is running and GPU is allocated.

Step 6: Feishu Notification (if configured)

After deployment is verified, check ~/.claude/feishu.json:

  • Send experiment_done notification: which experiments launched, which GPUs, estimated time
  • If config absent or mode "off": skip entirely (no-op)

Key Rules

  • ALWAYS check GPU availability first — never blindly assign GPUs
  • Each experiment gets its own screen session + GPU (remote) or background process (local)
  • Use tee to save logs for later inspection
  • Run deployment commands with run_in_background: true to keep conversation responsive
  • Report back: which GPU, which screen/process, what command, estimated time
  • If multiple experiments, launch them in parallel on different GPUs

CLAUDE.md Example

Users should add their server info to their project's CLAUDE.md:

## Remote Server
- SSH: `ssh my-gpu-server`
- GPU: 4x A100 (80GB each)
- Conda: `eval "$(/opt/conda/bin/conda shell.bash hook)" && conda activate research`
- Code dir: `/home/user/experiments/`

## Local Environment
- Mac MPS / Linux CUDA
- Conda env: `ml` (Python 3.10 + PyTorch)
Weekly Installs
8
GitHub Stars
1.3K
First Seen
4 days ago
Installed on
gemini-cli8
github-copilot8
codex8
kimi-cli8
cursor8
amp8