gpu-container-setup
GPU Container Setup Skill
This skill automates multi-vendor GPU container setup for PyTorch workloads.
Supported GPU Vendors
| Vendor | PyTorch Backend | Detection |
|---|---|---|
| NVIDIA | CUDA | nvidia-smi |
| AMD | ROCm (HIP) | rocm-smi, /opt/rocm |
| Ascend | torch_npu | npu-smi, /usr/local/Ascend |
| Metax | torch_musa | mx-smi, /opt/metax |
| Iluvatar | torch_corex | ixsmi, /opt/iluvatar |
Execution Flow
When invoked, follow these steps:
Step 1: Parse Arguments
Check if user provided:
--vendor <name>- Force specific vendor (skip detection)--image <image>- Force specific container image--data <path>- Force specific data mount path--name <name>- Container name (default:pytorch-gpu)
Step 2: Detect GPU Vendor
Run the detection script:
python3 .claude/skills/gpu-container-setup/scripts/detect_gpu.py
Expected output:
{"vendor": "ascend", "devices": ["Ascend 910B"], "count": 8}
If detection fails and no --vendor flag provided, ask user which vendor to use.
Step 3: Find Data Disk
Run the data disk detection:
python3 .claude/skills/gpu-container-setup/scripts/find_data_disk.py
Expected output:
{"data_disk": "/mnt/data", "found": true, "size": "2.0T", "available": "1.5T"}
If no suitable disk found, ask user for data mount path.
Step 4: Find Container Image
Follow strict priority order (only proceed to next if current fails):
1. Primary Vendor Hub (hardcoded) → 2. BAAI Harbor → 3. Web Search → 4. Local Images → 5. Ask User
Step 4.1: Primary Vendor Hub (hardcoded URLs)
| Vendor | Registry | API/Query |
|---|---|---|
| NVIDIA | nvcr.io |
https://api.ngc.nvidia.com/v2/repos/nvidia/pytorch/tags |
| Ascend | ascendhub.huawei.com |
Portal: https://ascendhub.huawei.com |
| Metax | registry.metax-tech.com |
https://registry.metax-tech.com/v2/pytorch/metax-pytorch/tags/list |
| Iluvatar | hub.iluvatar.com |
https://hub.iluvatar.com/v2/pytorch/iluvatar-pytorch/tags/list |
| AMD | docker.io (rocm/pytorch) |
https://hub.docker.com/v2/repositories/rocm/pytorch/tags |
# Example: Query NGC for latest NVIDIA PyTorch
TAG=$(curl -s "https://api.ngc.nvidia.com/v2/repos/nvidia/pytorch/tags" | jq -r '.tags[].name' | grep -E '^[0-9]{2}\.[0-9]{2}-py3$' | sort -rV | head -1)
IMAGE="nvcr.io/nvidia/pytorch:${TAG}"
Step 4.2: BAAI Harbor (fallback)
Only if Step 4.1 fails (unreachable, no image, pull fails).
# Query BAAI Harbor
curl -s "https://harbor.baai.ac.cn/api/v2.0/projects/flagrelease-public/repositories?page_size=100" | jq -r '.[].name' | grep "flagrelease-<vendor>"
Step 4.3: Web Search (fallback)
Only if Steps 4.1 and 4.2 fail. Search for "<vendor> pytorch docker official".
Step 4.4: Local Images (fallback)
Only if Steps 4.1-4.3 fail. Check docker images | grep pytorch.
Test Before Use
docker pull "${IMAGE}" && docker run --rm "${IMAGE}" python -c "import torch; print(torch.__version__)"
If test fails, try next source. If all fail, ask user for image.
Step 4.5: Update Skill (self-improvement)
IMPORTANT: If image found via Web Search (Step 4.3) passes all tests, update references/image-sources.md to add the newly discovered vendor hub as a primary source. This makes future lookups faster.
# After successful web search discovery:
# 1. Verify image works (pull + pytorch test + GPU test)
# 2. Extract registry URL pattern
# 3. Update references/image-sources.md Step 1 section with new vendor hub
Step 5: Build Docker Command
Refer to references/mount-requirements.md for vendor-specific requirements.
NVIDIA:
docker run -d --gpus all \
--name pytorch-gpu \
--shm-size=16g \
-v <data_disk>:/data \
<image> sleep infinity
AMD/ROCm:
docker run -d \
--device=/dev/kfd --device=/dev/dri \
--group-add video --group-add render \
--name pytorch-gpu \
--shm-size=16g \
-v <data_disk>:/data \
<image> sleep infinity
Ascend:
docker run -d \
--device=/dev/davinci0 --device=/dev/davinci1 ... \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /usr/local/Ascend:/usr/local/Ascend:ro \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi:ro \
--name pytorch-gpu \
--shm-size=16g \
-v <data_disk>:/data \
<image> sleep infinity
Metax:
docker run -d \
--device=/dev/mx0 --device=/dev/mx1 ... \
-v /opt/metax:/opt/metax:ro \
--name pytorch-gpu \
--shm-size=16g \
-v <data_disk>:/data \
<image> sleep infinity
Iluvatar:
docker run -d \
--device=/dev/bi0 --device=/dev/bi1 ... \
-v /opt/iluvatar:/opt/iluvatar:ro \
--name pytorch-gpu \
--shm-size=16g \
-v <data_disk>:/data \
<image> sleep infinity
Step 6: Start Container
Execute the docker run command. If container with same name exists:
- Check if it's running - offer to use existing or replace
- If stopped - offer to restart or replace
Step 7: Validate PyTorch GPU
Copy and run validation script inside container:
docker cp .claude/skills/gpu-container-setup/scripts/validate_pytorch.py pytorch-gpu:/tmp/
docker exec pytorch-gpu python3 /tmp/validate_pytorch.py
Expected output:
{
"status": "PASS",
"backend": "npu",
"device_count": 8,
"device_names": ["Ascend 910B", ...],
"tests": {
"device_detection": true,
"tensor_creation": true,
"matrix_multiply": true,
"gpu_to_cpu_transfer": true
}
}
Step 8: Report Results
Summarize to user:
- GPU vendor and devices detected
- Container name and image used
- Data mount path
- Validation status
- How to access:
docker exec -it pytorch-gpu bash
Error Handling
| Error | Action |
|---|---|
| No GPU detected | Ask user for vendor or check drivers |
| Image pull fails | Try alternative registry or web search |
| Container start fails | Check device permissions, show error |
| Validation fails | Show detailed error, suggest fixes |
Reference Files
references/gpu-detection.md- Detection methods by vendorreferences/image-sources.md- Image discovery guide (registry APIs, priority order, selection criteria)references/mount-requirements.md- Vendor mount specifications
Example Usage
User: /gpu-container-setup
User: setup a pytorch container
User: start container with ascend GPU
User: /gpu-container-setup --image nvcr.io/nvidia/pytorch:24.01-py3
User: /gpu-container-setup --image harbor.baai.ac.cn/flagrelease-public/ngctorch:2601