slurm
Slurm Cluster Management
Help developers submit, manage, and troubleshoot GPU-accelerated workloads on SRP's Slurm clusters. Supports training, inference, and data processing jobs using Apptainer containers.
When to Use This Skill
Use this skill when:
- Submitting GPU training or inference jobs to Slurm clusters
- Managing running or queued jobs
- Monitoring cluster resources and job status
- Debugging job failures or performance issues
- Writing Slurm job scripts with Apptainer containers
- Checking GPU availability and utilization
SRP Slurm Clusters
Oracle OKE Cluster (H100 GPUs)
SSH Access:
ssh -p 2222 <your-ldap-username>@129.80.180.16
# Example:
ssh -p 2222 zhuguangbin@129.80.180.16
GPU Type: H100
Partition: h100 (must specify in job scripts)
Use Cases: Large model training, high-performance inference
DO DOKS Cluster (H200 GPUs)
SSH Access:
ssh -p 2222 <your-ldap-username>@129.212.240.50
# Example:
ssh -p 2222 zhuguangbin@129.212.240.50
GPU Type: H200 Partition: Specify in job scripts Use Cases: Latest GPU workloads, large-scale training
Data Access
Both clusters use JuiceFS for unified data access:
- Path:
/data0/or/data/srp/ - Same permissions and directory structure as development machines
- Shared across all cluster nodes and with A10 dev machines
Monitoring
Oracle OKE Cluster Dashboards:
- Cluster Overview: https://grafana.g.yesy.site/d/edrg5th9t1edcb/slinky-slurm
- Workload Monitoring: https://grafana.g.yesy.site/d/f2c83374-71e2-42c6-92a1-10505b584cf2/workload
- Job-Level Stats: https://grafana.g.yesy.site/d/HRLkiLS7k/slurmjobstats
DO DOKS Cluster Dashboards:
- Cluster Overview: https://grafana.g2.yesy.site/d/edrg5th9t1edcb/slinky-slurm
- Workload Monitoring: https://grafana.g2.yesy.site/d/workload/workload
- Job-Level Stats: https://grafana.g2.yesy.site/d/slurm/slurm
Metrics Available:
- Cluster resource utilization
- GPU availability and usage
- Job queue status
- Per-job resource consumption
- Historical workload patterns
Essential Slurm Commands
Job Submission
# Submit batch job script
sbatch job_script.sh
# Submit with ssubmit wrapper (recommended)
ssubmit -j job_name -p h100 -g 1 -c 10 -m 32G -t 2:00:00 -cmd "python train.py"
# Interactive job allocation
salloc --partition=h100 --gres=gpu:1 --time=01:00:00
# Run command directly
srun --partition=h100 --gres=gpu:1 python test.py
Job Management
# View your jobs
squeue -u $USER
# View all jobs
squeue
# View specific job details
scontrol show job <job_id>
# Cancel job
scancel <job_id>
# Cancel all your jobs
scancel -u $USER
# Cancel jobs by name
scancel --name=job_name
Cluster Information
# View partitions and nodes
sinfo
# View detailed node info
sinfo -N -l
# Check GPU availability
sinfo -o "%20N %10c %10m %25f %10G"
# View specific partition
sinfo -p h100
Job History
# View completed jobs
sacct
# View specific job details
sacct -j <job_id> --format=JobID,JobName,Partition,AllocCPUS,State,ExitCode
# View jobs from last week
sacct --starttime=now-7days --format=JobID,JobName,Elapsed,State,ExitCode
Job Script Structure
Modern Slurm Script (Simplified)
The new Slinky Slurm clusters use prolog/epilog for notifications, so scripts are much simpler:
#!/bin/bash
#SBATCH --output=logs/%x_%j.out
#SBATCH --error=logs/%x_%j.err
#SBATCH --job-name=my-training-job
#SBATCH --partition=h100
#SBATCH --gres=gpu:H100:1
#SBATCH --nodes=1
#SBATCH --cpus-per-task=10
#SBATCH --mem=32GB
#SBATCH --time=02:00:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=slurm-notification@srp.one
set -x
#==============================
# Environment Setup
#==============================
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=$(shuf -i 1000-65535 -n 1)
export LOGLEVEL=INFO
export NCCL_DEBUG=INFO
export PYTHONFAULTHANDLER=1
# Set your tokens (replace with actual values)
export HF_TOKEN=your_huggingface_token_here
export WANDB_API_KEY=your_wandb_api_key_here
export WANDB_PROJECT=${SLURM_JOB_NAME}
export WANDB_NAME=${SLURM_JOB_NAME}-$(date +%Y%m%d%H%M%S)
#==============================
# Pre-task initialization
#==============================
echo "Running pre-task initialization..."
# Your setup commands here
#==============================
# Main Job Execution
#==============================
echo "Starting main task..."
srun -v -l --jobid $SLURM_JOBID --job-name=${SLURM_JOB_NAME} \
--output $SLURM_SUBMIT_DIR/logs/%x_%j_%s_%t_%N.out \
--error $SLURM_SUBMIT_DIR/logs/%x_%j_%s_%t_%N.err \
apptainer run --fakeroot --writable-tmpfs --nv \
/data0/apptainer/pytorch_24.01-py3.sif bash -ex << 'EOF'
# ==== YOUR JOB COMMANDS START ====
echo "Training started at $(date)"
python train.py \
--model gpt2 \
--batch-size 32 \
--epochs 10 \
--output-dir /data0/models/
nvidia-smi
echo "Training completed at $(date)"
# ==== YOUR JOB COMMANDS END ====
EOF
Key SBATCH Parameters
| Parameter | Description | Example |
|---|---|---|
--job-name |
Job name (shows in squeue) | my-training |
--partition |
Cluster partition | h100 |
--gres |
GPU resources | gpu:H100:1 (1 GPU)gpu:H100:2 (2 GPUs) |
--nodes |
Number of nodes | 1 (single node)2 (distributed) |
--cpus-per-task |
CPUs per task | 10 |
--mem |
Memory per node | 32GB |
--time |
Max runtime | 02:00:00 (2 hours) |
--output |
stdout log file | logs/%x_%j.out |
--error |
stderr log file | logs/%x_%j.err |
--mail-type |
Email notification | ALL, FAIL, END |
Log File Placeholders:
%x- Job name%j- Job ID%s- Step ID%t- Task ID%N- Node name
Multi-Node Distributed Training
#!/bin/bash
#SBATCH --job-name=distributed-training
#SBATCH --partition=h100
#SBATCH --nodes=2
#SBATCH --gres=gpu:H100:2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=10
#SBATCH --mem=64GB
#SBATCH --time=04:00:00
set -x
# Distributed training setup
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=12345
export WORLD_SIZE=$((SLURM_NNODES * SLURM_NTASKS_PER_NODE))
srun apptainer run --nv /data0/apptainer/pytorch_24.01-py3.sif \
python -m torch.distributed.launch \
--nproc_per_node=$SLURM_NTASKS_PER_NODE \
--nnodes=$SLURM_NNODES \
--node_rank=$SLURM_NODEID \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
train_distributed.py
Using Apptainer Containers
Available Container Images
Location: /data0/apptainer/
Common Images:
pytorch_24.01-py3.sif- PyTorch 24.01 with Python 3ray_2.52.0-py310-gpu.sif- Ray 2.52.0 with Python 3.10- Custom images built for specific projects
Apptainer Command Patterns
# Run container with GPU support
apptainer run --nv /data0/apptainer/pytorch_24.01-py3.sif python script.py
# Shell into container
apptainer shell --nv /data0/apptainer/pytorch_24.01-py3.sif
# Execute single command
apptainer exec --nv /data0/apptainer/pytorch_24.01-py3.sif nvidia-smi
# With additional flags
apptainer run --fakeroot --writable-tmpfs --nv <image.sif> <command>
Common Flags:
--nv- Enable NVIDIA GPU support--fakeroot- Fake root user privileges (for installing packages)--writable-tmpfs- Create writable temporary filesystem--bind <src>:<dst>- Mount additional directories
Interactive Container Session
# Start interactive job with Apptainer
sapptainer -c 20 -m 200G -g 1 -p h100 -i /data0/apptainer/pytorch_24.01-py3.sif
# Parameters:
# -c: CPUs
# -m: Memory
# -g: GPUs
# -p: Partition
# -i: Container image
Using ssubmit Wrapper
SRP provides ssubmit wrapper for simplified job submission:
# Basic usage
ssubmit -j job_name -p h100 -g 1 -c 10 -m 32G -t 2:00:00 \
-cmd "python train.py"
# With custom script
ssubmit -j my-job -p h100 -g 2 -s job_script.sh
# Interactive mode
ssubmit -j interactive -p h100 -g 1 -i
Parameters:
-j- Job name-p- Partition (h100, compute)-g- Number of GPUs-c- Number of CPUs-m- Memory (e.g., 32G)-t- Time limit (HH:MM:SS)-cmd- Command to run-s- Script file to execute-i- Interactive mode
Reference: https://github.com/SerendipityOneInc/llm-jobs/blob/main/slurm/ssubmit-examples/README.md
Feishu Notifications
Slurm clusters automatically send Feishu notifications for job events via prolog/epilog:
Notification Types:
- ✅ Job started
- ✅ Job completed successfully
- ❌ Job failed with error code
- ⏱️ Job timeout
- 🛑 Job cancelled
Notification Channel: slurm-notification@srp.one
What's Included:
- Job ID, name, partition
- Node allocation
- Start and end time
- Exit status
- Resource usage summary
- Log file locations
No Action Needed: Notifications are automatic - no need to add notification code to your scripts.
Best Practices
Resource Allocation
-
Request What You Need:
- Don't over-request CPUs/memory - it delays scheduling
- Start with minimal resources, scale up if needed
-
GPU Utilization:
- Use
nvidia-smito verify GPU is being used - Monitor GPU memory with
nvidia-smi dmon
- Use
-
Time Limits:
- Set realistic time limits (slightly above expected)
- Jobs exceeding time limit are killed
-
Partitions:
- Always specify partition explicitly
- Use
h100for Oracle, appropriate partition for DO
Job Organization
# Organize logs by date
#SBATCH --output=logs/%Y%m%d/%x_%j.out
#SBATCH --error=logs/%Y%m%d/%x_%j.err
# Or by job name
#SBATCH --output=logs/%x/%j.out
#SBATCH --error=logs/%x/%j.err
Checkpoint and Resume
# Save checkpoints periodically
import torch
import os
checkpoint_dir = "/data0/checkpoints"
checkpoint_path = os.path.join(checkpoint_dir, f"model_epoch_{epoch}.pt")
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}, checkpoint_path)
# Resume from checkpoint
if os.path.exists(checkpoint_path):
checkpoint = torch.load(checkpoint_path)
model.load_state_dict(checkpoint['model_state_dict'])
start_epoch = checkpoint['epoch'] + 1
Error Handling
# Set bash options for safety
set -e # Exit on error
set -u # Error on undefined variable
set -x # Print commands (useful for debugging)
set -o pipefail # Exit on pipe failure
# Add error traps
trap 'echo "Error on line $LINENO"; exit 1' ERR
Monitoring and Debugging
Check Job Status
# Detailed job info
scontrol show job <job_id>
# Watch job queue
watch -n 5 squeue -u $USER
# Check why job is pending
squeue -j <job_id> --start
View Logs
# Tail logs while job runs
tail -f logs/job_name_12345.out
# View last 100 lines
tail -n 100 logs/job_name_12345.out
# Search for errors
grep -i error logs/job_name_12345.err
GPU Monitoring
# Inside running job container
nvidia-smi
# Continuous monitoring
nvidia-smi dmon
# Detailed GPU utilization
nvidia-smi --query-gpu=timestamp,name,utilization.gpu,utilization.memory,memory.used,memory.free --format=csv -l 5
Resource Usage
# Check job efficiency
seff <job_id>
# Detailed accounting
sacct -j <job_id> --format=JobID,JobName,Elapsed,CPUTime,MaxRSS,State
Common Issues and Solutions
| Issue | Cause | Solution |
|---|---|---|
| Job pending forever | No available resources | Check sinfo for available GPUs; adjust resource requests |
| "Out of memory" error | Insufficient memory request | Increase --mem in job script |
| GPU not detected | Missing --gres or --nv |
Add --gres=gpu:X to sbatch, --nv to apptainer |
| Container not found | Wrong image path | Verify path in /data0/apptainer/ |
| Permission denied | File permissions issue | Check file ownership and permissions |
| Module not found | Missing Python packages | Install in container or use different image |
| NCCL timeout | Network issues in distributed training | Check NCCL env vars, verify nodes can communicate |
| Killed job (OOM) | Memory exceeded | Reduce batch size or increase --mem |
Quick Reference
Essential Commands
# Submit job
sbatch job.sh
# Check queue
squeue -u $USER
# Job details
scontrol show job <job_id>
# Cancel job
scancel <job_id>
# View logs
tail -f logs/job_*.out
# Cluster info
sinfo -p h100
# Job history
sacct --starttime=today
Example Workflows
1. Quick GPU Test
# Submit test job
sbatch << 'EOF'
#!/bin/bash
#SBATCH --job-name=gpu-test
#SBATCH --partition=h100
#SBATCH --gres=gpu:1
#SBATCH --time=00:10:00
#SBATCH --output=test_%j.out
srun apptainer exec --nv /data0/apptainer/pytorch_24.01-py3.sif \
nvidia-smi
EOF
2. Training with Checkpoints
#!/bin/bash
#SBATCH --job-name=training-with-checkpoint
#SBATCH --partition=h100
#SBATCH --gres=gpu:H100:1
#SBATCH --time=04:00:00
#SBATCH --signal=B:USR1@60
checkpoint_handler() {
echo "Received signal, saving checkpoint..."
# Signal Python process to save checkpoint
pkill -USR1 -f train.py
}
trap checkpoint_handler USR1
srun apptainer run --nv /data0/apptainer/pytorch_24.01-py3.sif \
python train.py \
--checkpoint-dir /data0/checkpoints \
--resume-if-exists
3. Batch Processing
#!/bin/bash
#SBATCH --job-name=batch-inference
#SBATCH --partition=h100
#SBATCH --gres=gpu:H100:1
#SBATCH --array=0-9
#SBATCH --time=01:00:00
# Process 10 shards in parallel
SHARD_ID=$SLURM_ARRAY_TASK_ID
srun apptainer run --nv /data0/apptainer/pytorch_24.01-py3.sif \
python inference.py \
--input /data0/input/shard_${SHARD_ID}.json \
--output /data0/output/shard_${SHARD_ID}.json
Resources
Official Documentation
- Slurm Commands: https://slurm.schedmd.com/man_index.html
- Slurm Quick Start: https://slurm.schedmd.com/quickstart.html
- Apptainer User Guide: https://apptainer.org/docs/user/latest/
SRP Resources
- Deployment Guide: https://starquest.feishu.cn/wiki/TZASwm86nivXLTkMV6kcoJF4n2I
- Oracle OKE Grafana:
- DO DOKS Grafana:
- ssubmit Examples: https://github.com/SerendipityOneInc/llm-jobs/blob/main/slurm/ssubmit-examples/README.md
Implementation Steps
When helping users with Slurm jobs:
-
Understand Requirements:
- What workload type? (training, inference, data processing)
- GPU requirements (quantity, memory)
- Expected runtime
- Data input/output locations
-
Choose Cluster:
- Oracle OKE (H100) for most workloads
- DO DOKS (H200) for cutting-edge GPU needs
-
Write Job Script:
- Use modern simplified template (no notification code)
- Specify appropriate resources
- Use Apptainer container with
--nvflag - Set up proper logging
-
Submit and Monitor:
- Submit with
sbatchorssubmit - Monitor with
squeueand Grafana - Check logs for errors
- Verify GPU utilization
- Submit with
-
Debug Issues:
- Check Feishu notifications for failure reasons
- Review log files
- Use
scontrolfor detailed job info - Consult troubleshooting table
-
Optimize:
- Adjust batch sizes based on GPU memory
- Use job arrays for parallel processing
- Implement checkpointing for long runs
- Monitor resource usage with
sacctandseff