ML Environment Setup & Troubleshooting

This skill helps you create and manage isolated ML environments with PyTorch. It auto-detects your hardware (NVIDIA GPU, AMD GPU, or CPU) and installs the appropriate PyTorch build.

Creating a New ML Project

I can help you set up a complete ML project with PyTorch in seconds. Here's what I'll do:

Create your project directory
Create a .gitignore for ML files
Run hardware detection
Install PyTorch with the right backend
Install ML libraries
Validate everything works

To get started, tell me:

Project name/path where you want it created
I'll handle the rest!

Interactive Setup Process

When you ask me to set up a new ML project, I will:

# 1. Create the project directory
mkdir -p ~/projects/my-ml-project

# 2. Create .gitignore
# (ignores ml-env/, data/, models/, logs/, etc.)

# 3. Run the setup script
bash ~/.claude/skills/ml-env/scripts/setup-universal.sh

# 4. Show you the results

The setup script will:

Detect your GPU (NVIDIA with nvidia-smi, AMD with rocminfo, or fallback to CPU)
Ask questions for special hardware (Blackwell GPUs, Strix Halo, etc.)
Install Python 3.13 virtual environment with uv
Install PyTorch 2.10.0 with correct backend
Install ML libraries: numpy, pandas, scikit-learn, jupyter, accelerate, etc.
Create ml-env/ directory in your project
Optionally initialize git

After Setup: Using Your Environment

Once created, activating is simple:

cd ~/projects/my-ml-project
source ml-env/bin/activate  # Regular environments
# OR
source ml-env/activate-safe.sh  # If you use conda (ignores conda settings)

Check it works:

python -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA: {torch.cuda.is_available()}')"

Hardware-Specific Guidance

NVIDIA GPUs

Supported: RTX 3090, 4090, 5090, and most Ampere/Ada/Blackwell
Installation: CUDA 12.8 (stable) or CUDA 13.0 (for RTX 5090)
Driver requirement: 520+ for CUDA 12.8, 550+ for CUDA 13.0
WSL2 users: Use Windows NVIDIA driver only (do NOT install Linux driver)

AMD RDNA (RX 6000/7000 series)

Installation: ROCm 6.2
Requirements: User in render and video groups
Setup: sudo usermod -aG render,video $USER && newgrp render

AMD Strix Halo (gfx1151)

⚠️ This requires special handling - official PyTorch wheels do NOT work!

GPU: Ryzen AI MAX+ 395 with gfx1151
Critical issue: Official PyTorch wheels fail with "HIP error: invalid device function"
Solution: Use AMD gfx1151-specific builds
ROCm 7 (Recommended): https://repo.amd.com/rocm/whl/gfx1151/ - ~31 TFLOPS BF16
ROCm 6.4.4+ (Fallback): https://rocm.nightlies.amd.com/v2/gfx1151/ - ~12 TFLOPS BF16
Memory limits: Default ~33GB; configure GTT for larger models (30B+)
Setup requires: User in render and video groups, Linux kernel 6.14+ (6.16.9+ recommended for automatic UMA/GTT behavior)

Reference project: See ~/Projects/amdtest for a working gfx1151 setup example.

See TROUBLESHOOTING.md for complete Strix Halo setup and GTT memory configuration.

CPU-Only Systems

Works everywhere
Good for development/testing
Use for learning before scaling to GPU

Current Versions (2026)

PyTorch: 2.10.0
Python: 3.13 (or 3.12 if needed)
CUDA: 12.8 (main), 13.0 (Blackwell experimental)
ROCm: 6.2 (RDNA), 7.x preferred for Strix Halo (6.4.4+ as fallback)
Key ML libs: numpy, pandas, matplotlib, scikit-learn, jupyter, accelerate, tensorboard

Validating an Existing Environment

If you already have a project and want to verify it works:

cd ~/your-ml-project
bash ~/.claude/skills/ml-env/scripts/validate.sh

This will check:

Python and PyTorch versions
GPU/CPU backend detection
GPU memory and specifications
Computation tests

Troubleshooting

GPU Not Detected

NVIDIA:

nvidia-smi  # Check driver is installed
python -c "import torch; print(torch.cuda.is_available())"

AMD:

rocm-smi  # Check ROCm installation
rocminfo | grep gfx  # Check GPU architecture

CUDA Out of Memory

Reduce batch size
Enable mixed precision training
Use torch.cuda.empty_cache() between batches
Try gradient accumulation

PyTorch Not Finding GPU After Install

Activate the environment: source ml-env/bin/activate
Check driver version

Reinstall PyTorch with correct index URL:

uv pip install --upgrade torch --index-url https://download.pytorch.org/whl/cu128

Strix Halo Specific Issues

See TROUBLESHOOTING.md for detailed Strix Halo troubleshooting, GTT memory setup, and performance optimization.

Best Practices

Always activate first: Before running any Python/ML code
Use virtual environments: Never install to system Python
Move models to device: Explicitly move tensors to GPU
Monitor memory: Keep an eye on GPU memory usage
Test on CPU first: Develop with small data on CPU, scale to GPU
Save checkpoints: Don't train for hours without saving progress

Common Workflows

Training a Model

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = YourModel().to(device)

for batch in dataloader:
    x, y = batch
    x, y = x.to(device), y.to(device)
    loss = model(x, y)
    loss.backward()

Mixed Precision Training (Faster, Less Memory)

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for batch in dataloader:
    with autocast():
        loss = model(x, y)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Checking GPU Memory

import torch
if torch.cuda.is_available():
    print(f"Allocated: {torch.cuda.memory_allocated()/1e9:.2f}GB")
    print(f"Reserved: {torch.cuda.memory_reserved()/1e9:.2f}GB")
    print(f"Total: {torch.cuda.get_device_properties(0).total_memory/1e9:.2f}GB")

Reference Documentation

TROUBLESHOOTING.md - Common issues, hardware-specific setup (especially Strix Halo)
UPDATE.md - Updating PyTorch and dependencies

Scripts in This Skill

All scripts are in ~/.claude/skills/ml-env/scripts/:

setup-universal.sh - Hardware detection and PyTorch installation (used during initial setup)
validate.sh - Validate an existing environment and test GPU/CPU

When to Use This Skill

Use me when you:

Want to create a new ML project
Need to set up PyTorch with GPU support
Are troubleshooting GPU/CUDA/ROCm issues
Want to update or maintain your ML environment
Have hardware-specific questions (NVIDIA, AMD, Strix Halo)
Need guidance on ML best practices

Questions?

Ask me anything about:

Creating new ML projects
Hardware setup and troubleshooting
PyTorch installation
GPU/CPU configuration
ML best practices
Updating packages