cuda-guide

SKILL.md

CUDA Guide

Applies to: CUDA 11+, GPU Computing, Deep Learning, Scientific Computing, HPC

Core Principles

  1. Parallelism First: Design algorithms for thousands of concurrent threads; serial thinking is the primary enemy of GPU performance
  2. Memory Hierarchy Awareness: Global memory is 100x slower than shared memory and 1000x slower than registers; every kernel design starts with memory access planning
  3. Coalesced Access: Adjacent threads must access adjacent memory addresses; a single misaligned access pattern can reduce bandwidth by 32x
  4. Occupancy Over Cleverness: Maximize active warps per SM by managing register count, shared memory usage, and block dimensions together
  5. Minimize Host-Device Transfers: PCIe bandwidth is the bottleneck; overlap transfers with computation using streams and pinned memory

Guardrails

Error Checking

  • ALWAYS check CUDA API return values with a macro wrapper
  • ALWAYS call cudaGetLastError() after every kernel launch
  • ALWAYS call cudaDeviceSynchronize() before reading kernel results on the host
  • Use compute-sanitizer (successor to cuda-memcheck) in development builds
  • Handle cudaErrorMemoryAllocation gracefully; never assume GPU memory is infinite
#define CUDA_CHECK(call)                                                    \
    do {                                                                    \
        cudaError_t err = call;                                            \
        if (err != cudaSuccess) {                                          \
            fprintf(stderr, "CUDA error at %s:%d: %s\n",                   \
                    __FILE__, __LINE__, cudaGetErrorString(err));           \
            exit(EXIT_FAILURE);                                            \
        }                                                                   \
    } while (0)

#define CUDA_CHECK_KERNEL()                                                 \
    do {                                                                    \
        cudaError_t err = cudaGetLastError();                              \
        if (err != cudaSuccess) {                                          \
            fprintf(stderr, "Kernel launch error at %s:%d: %s\n",          \
                    __FILE__, __LINE__, cudaGetErrorString(err));           \
            exit(EXIT_FAILURE);                                            \
        }                                                                   \
    } while (0)

Memory Management

  • Pair every cudaMalloc with a cudaFree; prefer RAII wrappers in C++ host code
  • Use cudaMallocManaged (Unified Memory) for prototyping; switch to explicit transfers for production
  • Use cudaMallocHost (pinned memory) when streaming data to the GPU; pageable memory cannot overlap with compute
  • Prefer cudaMemcpyAsync with streams over synchronous cudaMemcpy
  • Never access device pointers from host code or host pointers from device code (except Unified Memory)
  • Call cudaMemset or cudaMemsetAsync to zero-initialize device buffers

Kernel Design

  • Block size must be a multiple of warp size (32); prefer 128, 256, or 512
  • Calculate grid size as (n + block_size - 1) / block_size
  • Always include bounds checking: if (idx < n) at the top of every kernel
  • Use grid-stride loops for kernels that must handle arbitrary data sizes
  • Document thread mapping: which dimension maps to which data axis
  • Mark device-only helpers as __device__, host+device as __host__ __device__
// Grid-stride loop: works with any grid size, any data size
__global__ void saxpy(float a, const float* x, float* y, int n) {
    for (int i = blockIdx.x * blockDim.x + threadIdx.x;
         i < n;
         i += blockDim.x * gridDim.x) {
        y[i] = a * x[i] + y[i];
    }
}

Synchronization

  • Use __syncthreads() after every shared memory write before any thread reads another thread's value
  • Never place __syncthreads() inside a conditional branch that not all threads in a block will reach (deadlock)
  • Use __syncwarp() (CUDA 9+) for warp-level synchronization instead of relying on implicit warp-synchronous execution
  • Use cudaDeviceSynchronize() sparingly in production; prefer stream synchronization with cudaStreamSynchronize()
  • Use CUDA events (cudaEventRecord / cudaEventSynchronize) for fine-grained inter-stream ordering

Performance

  • Profile before optimizing: use Nsight Compute for kernel analysis, Nsight Systems for system-level view
  • Target >50% theoretical occupancy; use the CUDA Occupancy Calculator to tune block dimensions
  • Aim for >60% of peak memory bandwidth in memory-bound kernels
  • Avoid warp divergence: ensure threads within a warp take the same branch when possible
  • Prefer float over double on consumer GPUs (2x throughput difference)
  • Minimize atomic operations on global memory; use shared memory atomics with a final reduction

Memory Hierarchy

Understanding the memory hierarchy is the single most important factor in CUDA performance.

Memory Type Scope Latency (cycles) Size Cached Read/Write
Registers Thread 1 ~255 per thread N/A R/W
Shared Block ~5 48-164 KB per SM N/A R/W
L1 Cache SM ~28 48-192 KB per SM Auto R
L2 Cache Device ~200 4-40 MB Auto R/W
Global Device ~400-600 4-80 GB (HBM/GDDR) Yes R/W
Constant Device ~5 (cached) 64 KB Yes (broadcast) R
Texture Device ~400 (cached) Global pool Yes (spatial) R

Decision guide:

  • Data reused within a thread -> registers (automatic via local variables)
  • Data shared across threads in a block -> __shared__ memory
  • Read-only data broadcast to all threads -> __constant__ memory
  • Large read-only data with spatial locality -> texture memory
  • Everything else -> global memory with coalesced access patterns

Key Patterns

Kernel Launch Configuration

// Query device for optimal configuration
void launch_optimized(const float* input, float* output, int n) {
    int block_size;
    int min_grid_size;

    // Let the runtime suggest optimal block size for maximum occupancy
    cudaOccupancyMaxPotentialBlockSize(
        &min_grid_size, &block_size, my_kernel, 0, n);

    int grid_size = (n + block_size - 1) / block_size;
    my_kernel<<<grid_size, block_size>>>(input, output, n);
    CUDA_CHECK_KERNEL();
}

Coalesced Memory Access

// BAD: Strided access -- adjacent threads access non-adjacent memory
// Each warp issues 32 separate memory transactions
__global__ void transpose_naive(const float* in, float* out, int W, int H) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    if (x < W && y < H) {
        out[x * H + y] = in[y * W + x];  // Write is strided
    }
}

// GOOD: Use shared memory to coalesce both reads and writes
__global__ void transpose_coalesced(
    const float* in, float* out, int W, int H
) {
    __shared__ float tile[32][33]; // +1 padding avoids bank conflicts

    int x = blockIdx.x * 32 + threadIdx.x;
    int y = blockIdx.y * 32 + threadIdx.y;

    if (x < W && y < H) {
        tile[threadIdx.y][threadIdx.x] = in[y * W + x]; // Coalesced read
    }
    __syncthreads();

    x = blockIdx.y * 32 + threadIdx.x;
    y = blockIdx.x * 32 + threadIdx.y;

    if (x < H && y < W) {
        out[y * H + x] = tile[threadIdx.x][threadIdx.y]; // Coalesced write
    }
}

Shared Memory Tiling

// Dot product of two vectors using shared memory reduction
__global__ void dot_product(
    const float* a, const float* b, float* result, int n
) {
    __shared__ float cache[256];

    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    // Each thread computes its partial sum via grid-stride
    float partial = 0.0f;
    for (int i = idx; i < n; i += blockDim.x * gridDim.x) {
        partial += a[i] * b[i];
    }
    cache[tid] = partial;
    __syncthreads();

    // Tree reduction in shared memory
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            cache[tid] += cache[tid + s];
        }
        __syncthreads();
    }

    if (tid == 0) {
        atomicAdd(result, cache[0]);
    }
}

Warp-Level Primitives (CUDA 9+)

// Warp-level reduction using shuffle instructions -- no shared memory needed
__device__ float warp_reduce_sum(float val) {
    for (int offset = warpSize / 2; offset > 0; offset /= 2) {
        val += __shfl_down_sync(0xFFFFFFFF, val, offset);
    }
    return val;
}

// Block-level reduction combining warp shuffles and shared memory
__device__ float block_reduce_sum(float val) {
    __shared__ float warp_sums[32]; // One slot per warp (max 32 warps/block)

    int lane = threadIdx.x % warpSize;
    int warp_id = threadIdx.x / warpSize;

    val = warp_reduce_sum(val);

    if (lane == 0) {
        warp_sums[warp_id] = val;
    }
    __syncthreads();

    // First warp reduces the warp sums
    int num_warps = (blockDim.x + warpSize - 1) / warpSize;
    val = (threadIdx.x < num_warps) ? warp_sums[threadIdx.x] : 0.0f;
    if (warp_id == 0) {
        val = warp_reduce_sum(val);
    }

    return val;
}

Performance

Occupancy Calculator

// Query occupancy at compile time for tuning
void report_occupancy() {
    int block_size = 256;
    int num_blocks;

    cudaOccupancyMaxActiveBlocksPerMultiprocessor(
        &num_blocks, my_kernel, block_size, 0);

    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);

    int active_warps = num_blocks * (block_size / prop.warpSize);
    int max_warps = prop.maxThreadsPerMultiProcessor / prop.warpSize;
    float occupancy = (float)active_warps / max_warps;

    printf("Occupancy: %.1f%% (%d/%d warps)\n",
           occupancy * 100, active_warps, max_warps);
}

Nsight Profiling Workflow

# System-level trace: find CPU/GPU idle gaps, stream concurrency
nsys profile -o trace ./program
nsys stats trace.nsys-rep

# Kernel-level analysis: roofline, memory throughput, occupancy
ncu --set full -o kernel_report ./program
ncu -i kernel_report.ncu-rep    # Open in Nsight Compute GUI

# Quick single-metric check
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed ./program

Memory Bandwidth Measurement

// Measure effective bandwidth of a kernel
void measure_bandwidth(int n) {
    size_t bytes = 2 * n * sizeof(float); // Read A + Write B

    cudaEvent_t start, stop;
    CUDA_CHECK(cudaEventCreate(&start));
    CUDA_CHECK(cudaEventCreate(&stop));

    CUDA_CHECK(cudaEventRecord(start));
    copy_kernel<<<grid, block>>>(d_in, d_out, n);
    CUDA_CHECK(cudaEventRecord(stop));
    CUDA_CHECK(cudaEventSynchronize(stop));

    float ms = 0;
    CUDA_CHECK(cudaEventElapsedTime(&ms, start, stop));

    float gb_per_sec = bytes / (ms * 1e6);
    printf("Effective bandwidth: %.2f GB/s\n", gb_per_sec);

    CUDA_CHECK(cudaEventDestroy(start));
    CUDA_CHECK(cudaEventDestroy(stop));
}

Tooling

Essential Commands

# Compile CUDA code
nvcc -arch=sm_80 -O3 -o program main.cu         # Single file
nvcc -arch=native -lineinfo -o program main.cu   # With debug line info

# CMake build
cmake -B build -DCMAKE_CUDA_ARCHITECTURES="70;80;86"
cmake --build build -j$(nproc)

# Runtime debugging
compute-sanitizer ./program                      # Memory errors (replaces cuda-memcheck)
compute-sanitizer --tool racecheck ./program     # Shared memory race conditions
compute-sanitizer --tool initcheck ./program     # Uninitialized device memory reads
compute-sanitizer --tool synccheck ./program     # Synchronization errors

# Profiling
nsys profile ./program                           # System-level timeline
ncu ./program                                    # Kernel-level metrics
ncu --kernel-name my_kernel --launch-skip 2 --launch-count 1 ./program

# Device info
nvidia-smi                                       # GPU status and memory usage
nvcc --version                                   # CUDA compiler version

CMakeLists.txt Template

cmake_minimum_required(VERSION 3.18)
project(myproject LANGUAGES CXX CUDA)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CUDA_STANDARD 17)
set(CMAKE_CUDA_ARCHITECTURES 70 80 86)
set(CMAKE_CUDA_SEPARABLE_COMPILATION ON)

find_package(CUDAToolkit REQUIRED)

add_library(kernels src/kernels.cu)
target_include_directories(kernels PUBLIC include)

add_executable(main src/main.cpp)
target_link_libraries(main kernels CUDA::cudart)

enable_testing()
add_executable(tests tests/test_kernels.cu)
target_link_libraries(tests kernels CUDA::cudart)
add_test(NAME gpu_tests COMMAND tests)

References

For detailed patterns and examples, see:

External References

Weekly Installs
6
Repository
ar4mirez/samuel
GitHub Stars
3
First Seen
Mar 1, 2026
Installed on
cline6
gemini-cli6
github-copilot6
codex6
kimi-cli6
cursor6