slurm

SKILL.md

SLURM Assistant

Help the user write job scripts, debug failed jobs, and manage cluster resources.

Job Script Guidelines

  • Always include: --job-name, --output, --error, --time, --mem, --gres (for GPUs), --cpus-per-task
  • Place scripts in a dedicated folder (e.g. scripts/)
  • Use set -euo pipefail in the bash portion
  • Log key info at the start: hostname, GPU info (nvidia-smi), date, git commit hash
  • Activate the correct virtual environment before running Python

Resource Allocation Rules

  • Small experiments (<1M params): 1 GPU, 4-8 CPUs, 16-32GB RAM
  • Medium experiments (1M-1B params): 1-2 GPUs, 8-16 CPUs, 32-64GB RAM
  • Large models (7B+): multiple GPUs, 64-128GB+ RAM
  • 32B+ inference: 4+ GPUs, match tensor parallelism to GPU count
  • Rule of thumb: ~4-8 CPUs per GPU, ~2x model size in FP16 for VRAM

Known GPU Types & Selection

GPU types (use with --gres=gpu:<type>:N)

  • a100: A100 40GB HBM2e
  • a100l: A100 80GB HBM2e
  • a6000: RTX A6000 48GB GDDR6
  • h100: H100 80GB HBM3
  • l40s: L40S ~45GB GDDR6
  • rtx8000: Quadro RTX 8000 48GB GDDR6
  • v100: V100 32GB HBM2

GPU selection by attribute

You can also request GPUs by memory, architecture, or feature:

  • By memory: --gres=gpu:48gb:1 (any 48GB GPU: RTX8000, A6000, L40S)
  • By arch: --gres=gpu:ampere:1 (A100, A6000, L40S)
  • By interconnect: --gres=gpu:nvlink:1
  • By system: --gres=gpu:dgx:1
  • Memory tags: 12gb, 32gb, 40gb, 48gb, 80gb
  • Arch tags: volta, turing, ampere

Node Inventory

Nodes Count GPUs CPUs RAM
cn-l[001-091] 91 4x L40S (48GB) 48 1024GB
cn-c[001-040] 40 8x RTX8000 (48GB) 64 384GB
cn-g[001-029] 29 4x A100 (80GB) 64 1024GB
cn-a[001-011] 11 8x RTX8000 (48GB) 40 384GB
cn-b[001-005] 5 8x V100 (32GB) 40 384GB
cn-k[001-004] 4 4x A100 (40GB) 48 512GB
cn-n[001-002] 2 8x H100 (80GB) 192 2048GB
cn-d[001-004] (DGX) 4 8x A100 (40/80GB) 128 1024-2048GB
cn-j001 1 8x A6000 (48GB) 64 1024GB

GPUs per node is either 4 or 8 — don't request more than the node type has.

Partitions & Preemption

Partition Time Limit Per-User Limits
long (default) 7 days No per-user GPU cap
main 5 days 2 GPUs, 8 CPUs, 48GB
short 3 hours 4 GPUs, 1TB mem
unkillable 2 days 1 GPU, 6 CPUs, 32GB

Preemption hierarchy: unkillable > main > long. Once preempted, jobs are killed and auto-requeued. main jobs do NOT preempt other main jobs. -grace variants give a SIGTERM grace period before kill. Checkpoint frequently on long partition.

Storage

Path Quota Key Policy
$HOME 100GB / 1M files Daily backup, low I/O — don't write logs here
$SCRATCH 5TB / unlimited Files unused >90 days deleted
$SLURM_TMPDIR No quota Fastest I/O, cleared after job
/network/projects/<group>/ 1TB / 1M files Shared project storage
$ARCHIVE 5TB No backup, not on GPU nodes

Always copy data to $SLURM_TMPDIR at job start for performance. Write logs/outputs to $SCRATCH, not $HOME. Check usage with disk-quota.

Module System

  • module load python/3.10 — required before creating venvs on cluster
  • module load miniconda/3 — for conda environments
  • module avail / module spider <term> — search available modules
  • Pre-built PyTorch/TF modules exist for Mila GPUs
  • On login/CPU nodes without GPUs: CONDA_OVERRIDE_CUDA=11.8 before conda commands

Debugging Failed Jobs

  • Check .err files first — experiment logs go to stderr
  • sacct -j <jobid> --format=JobID,State,ExitCode,MaxRSS,Elapsed,NodeList for completed jobs
  • Common issues: OOM (check MaxRSS), time limit, bad path, missing module/env
  • For OOM: check batch size, model size, gradient accumulation, and whether --mem was sufficient
  • torch.autograd.set_detect_anomaly(True) causes extreme filesystem IOPS — never leave on in batch jobs, admins will flag it

Monitoring

  • disk-quota — check storage usage
  • squeue -u $USER — your active jobs
  • echo $SLURM_JOB_GPUS — which GPU(s) your job got
  • Netdata per-node: <node>.server.mila.quebec:19999 (requires Mila wifi or SSH tunnel)
  • Grafana dashboard: dashboard.server.mila.quebec

Limits

  • Max 1000 jobs per user in the system at any time

Safety

  • Never submit jobs (sbatch) without explicit user confirmation
  • Verify paths and configs before submission
  • Test on small instances first when possible

Scope

$ARGUMENTS

Weekly Installs
4
GitHub Stars
8
First Seen
11 days ago
Installed on
cline4
github-copilot4
codex4
kimi-cli4
gemini-cli4
cursor4