gpu-server-management
SKILL.md
GPU Server Management
Provision, configure, and monitor NVIDIA GPU servers for AI inference and training workloads.
When to Use This Skill
Use this skill when:
- Setting up a new GPU server for LLM inference or model training
- Installing or upgrading NVIDIA drivers and CUDA toolkit
- Configuring Docker with NVIDIA Container Toolkit for GPU workloads
- Partitioning A100/H100 GPUs with MIG for multi-tenant workloads
- Troubleshooting GPU errors, driver issues, or thermal throttling
Prerequisites
- Ubuntu 22.04 LTS (recommended) or RHEL 8/9
- NVIDIA GPU (A10G, A100, H100, RTX 4090, or L40S recommended)
- Root or sudo access
- Internet access for package downloads
Driver Installation (Ubuntu)
# Remove old drivers
sudo apt purge -y 'nvidia*' 'cuda*' 'libcuda*'
sudo apt autoremove -y
# Add NVIDIA package repository
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
# Install latest driver (560.x as of 2025)
sudo apt install -y nvidia-driver-560 cuda-toolkit-12-6
# Install NVIDIA Container Toolkit (Docker GPU support)
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# Verify
nvidia-smi
nvcc --version
docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu22.04 nvidia-smi
Post-Install Configuration
# Enable persistence mode (reduces driver initialization latency)
sudo nvidia-smi -pm 1
# Set power limit (reduce heat/noise on inference servers)
sudo nvidia-smi -pl 350 # watts; check TDP for your GPU model
# Disable ECC on inference servers (frees ~6% VRAM, less safe)
sudo nvidia-smi --ecc-config=0 # requires reboot
# Enable P2P for multi-GPU NVLink training
sudo nvidia-smi topo -m # check NVLink topology
GPU Health Monitoring
# Real-time monitoring (like htop for GPUs)
watch -n 1 nvidia-smi
# Detailed stats
nvidia-smi --query-gpu=index,name,temperature.gpu,utilization.gpu,\
utilization.memory,memory.used,memory.free,power.draw,clocks.current.graphics \
--format=csv --loop=1
# DCGM — production monitoring daemon (for clusters)
sudo apt install -y datacenter-gpu-manager
sudo systemctl start dcgm
dcgmi discovery -l # list GPUs
dcgmi diag -r 1 # quick health check
dcgmi diag -r 3 # full diagnostic (takes ~20 min)
# Check GPU errors (XID errors — important for stability)
sudo dmesg | grep -i "NVRM\|nvidia\|XID"
nvidia-smi --query-gpu=ecc.errors.corrected.volatile.total \
--format=csv,noheader
Prometheus GPU Metrics (DCGM Exporter)
# Deploy DCGM Exporter for Prometheus scraping
docker run -d \
--name dcgm-exporter \
--gpus all \
--cap-add SYS_ADMIN \
-p 9400:9400 \
--restart unless-stopped \
nvcr.io/nvidia/k8s/dcgm-exporter:latest
# Key metrics exposed:
# DCGM_FI_DEV_GPU_UTIL - GPU utilization %
# DCGM_FI_DEV_MEM_COPY_UTIL - Memory bandwidth utilization
# DCGM_FI_DEV_FB_USED - Framebuffer memory used (MB)
# DCGM_FI_DEV_SM_CLOCK - SM clock speed (MHz)
# DCGM_FI_DEV_GPU_TEMP - Temperature (°C)
# DCGM_FI_DEV_POWER_USAGE - Power draw (W)
# DCGM_FI_DEV_XID_ERRORS - XID error count (0 = healthy)
MIG Partitioning (A100/H100)
MIG (Multi-Instance GPU) allows slicing one GPU into isolated smaller GPUs.
# Enable MIG mode (requires reboot or restart of all processes)
sudo nvidia-smi -mig 1
sudo systemctl restart nvidia-persistenced
# List available MIG profiles (A100 80GB example)
nvidia-smi mig -lgip
# 1g.10gb — 1 slice, 10GB (max 7 instances)
# 2g.20gb — 2 slices, 20GB (max 3 instances)
# 3g.40gb — 3 slices, 40GB (max 2 instances)
# 7g.80gb — full GPU, 80GB (max 1 instance)
# Create MIG instances (e.g., 3× 2g.20gb + 1× 2g.20gb = multi-tenant)
sudo nvidia-smi mig -cgi 2g.20gb,2g.20gb,2g.20gb,2g.20gb -C
# List created instances
nvidia-smi mig -lgi
nvidia-smi mig -lcgi
# Use in Docker
docker run --gpus '"device=MIG-GPU-xxx/0/0"' ...
# Disable MIG
sudo nvidia-smi mig -i 0 -dci
sudo nvidia-smi mig -i 0 -dgi
sudo nvidia-smi -mig 0
Kernel & OS Tuning for GPU Servers
# Increase file descriptor limits
echo '* soft nofile 1048576' | sudo tee -a /etc/security/limits.conf
echo '* hard nofile 1048576' | sudo tee -a /etc/security/limits.conf
# Disable transparent huge pages (reduces latency jitter)
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
# Persist via rc.local or systemd unit:
cat <<'EOF' | sudo tee /etc/rc.local
#!/bin/bash
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
nvidia-smi -pm 1
exit 0
EOF
sudo chmod +x /etc/rc.local
# PCIe performance mode
sudo nvidia-smi --auto-boost-default=0
sudo nvidia-smi --auto-boost-permission=0
Multi-GPU Topology Check
# Check NVLink and PCIe topology
nvidia-smi topo -m
# Output shows interconnect type:
# NV4 = NVLink 4.0 (H100 SXM)
# NV2 = NVLink 2.0 (A100 SXM)
# PHB = PCIe bus (slower; avoid for tensor parallel training)
# PIX = same PCIe switch (fast)
# Bandwidth test between GPUs
/usr/local/cuda/samples/bin/x86_64/linux/release/p2pBandwidthLatencyTest
Common Issues
| Issue | Cause | Fix |
|---|---|---|
nvidia-smi: command not found |
Driver not installed | Follow driver installation steps above |
| Driver version mismatch | CUDA/driver incompatibility | Check compatibility matrix at developer.nvidia.com |
| GPU temperature >85°C | Poor airflow or fan failure | Check nvidia-smi -q -d TEMPERATURE; reseat cooler |
| XID 79 errors | GPU hardware error | Run dcgmi diag -r 3; may need GPU replacement |
failed to open device in container |
Container toolkit not configured | Run nvidia-ctk runtime configure --runtime=docker |
| Low PCIe bandwidth | Wrong slot or power limit | Check `nvidia-smi -q |
Best Practices
- Always enable persistence mode (
nvidia-smi -pm 1) — reduces first-request latency. - Monitor XID errors; persistent XID 79/94 indicates hardware failure.
- For training: use NVLink-connected GPUs; for inference: PCIe is usually fine.
- Set up DCGM alerts on temperature >80°C and power draw near TDP.
- Use MIG for multi-tenant inference to provide GPU isolation between models.
Related Skills
- vllm-server - LLM inference on GPUs
- llm-fine-tuning - GPU training setup
- linux-hardening - Secure the host OS
- prometheus-grafana - Metrics dashboards
Weekly Installs
3
Repository
bagelhole/devop…t-skillsGitHub Stars
13
First Seen
5 days ago
Security Audits
Installed on
opencode3
antigravity3
claude-code3
github-copilot3
codex3
zencoder3