cuda-guide
SKILL.md
CUDA Guide
Applies to: CUDA 11+, GPU Computing, Deep Learning, Scientific Computing, HPC
Core Principles
- Parallelism First: Design algorithms for thousands of concurrent threads; serial thinking is the primary enemy of GPU performance
- Memory Hierarchy Awareness: Global memory is 100x slower than shared memory and 1000x slower than registers; every kernel design starts with memory access planning
- Coalesced Access: Adjacent threads must access adjacent memory addresses; a single misaligned access pattern can reduce bandwidth by 32x
- Occupancy Over Cleverness: Maximize active warps per SM by managing register count, shared memory usage, and block dimensions together
- Minimize Host-Device Transfers: PCIe bandwidth is the bottleneck; overlap transfers with computation using streams and pinned memory
Guardrails
Error Checking
- ALWAYS check CUDA API return values with a macro wrapper
- ALWAYS call
cudaGetLastError()after every kernel launch - ALWAYS call
cudaDeviceSynchronize()before reading kernel results on the host - Use
compute-sanitizer(successor tocuda-memcheck) in development builds - Handle
cudaErrorMemoryAllocationgracefully; never assume GPU memory is infinite
#define CUDA_CHECK(call) \
do { \
cudaError_t err = call; \
if (err != cudaSuccess) { \
fprintf(stderr, "CUDA error at %s:%d: %s\n", \
__FILE__, __LINE__, cudaGetErrorString(err)); \
exit(EXIT_FAILURE); \
} \
} while (0)
#define CUDA_CHECK_KERNEL() \
do { \
cudaError_t err = cudaGetLastError(); \
if (err != cudaSuccess) { \
fprintf(stderr, "Kernel launch error at %s:%d: %s\n", \
__FILE__, __LINE__, cudaGetErrorString(err)); \
exit(EXIT_FAILURE); \
} \
} while (0)
Memory Management
- Pair every
cudaMallocwith acudaFree; prefer RAII wrappers in C++ host code - Use
cudaMallocManaged(Unified Memory) for prototyping; switch to explicit transfers for production - Use
cudaMallocHost(pinned memory) when streaming data to the GPU; pageable memory cannot overlap with compute - Prefer
cudaMemcpyAsyncwith streams over synchronouscudaMemcpy - Never access device pointers from host code or host pointers from device code (except Unified Memory)
- Call
cudaMemsetorcudaMemsetAsyncto zero-initialize device buffers
Kernel Design
- Block size must be a multiple of warp size (32); prefer 128, 256, or 512
- Calculate grid size as
(n + block_size - 1) / block_size - Always include bounds checking:
if (idx < n)at the top of every kernel - Use grid-stride loops for kernels that must handle arbitrary data sizes
- Document thread mapping: which dimension maps to which data axis
- Mark device-only helpers as
__device__, host+device as__host__ __device__
// Grid-stride loop: works with any grid size, any data size
__global__ void saxpy(float a, const float* x, float* y, int n) {
for (int i = blockIdx.x * blockDim.x + threadIdx.x;
i < n;
i += blockDim.x * gridDim.x) {
y[i] = a * x[i] + y[i];
}
}
Synchronization
- Use
__syncthreads()after every shared memory write before any thread reads another thread's value - Never place
__syncthreads()inside a conditional branch that not all threads in a block will reach (deadlock) - Use
__syncwarp()(CUDA 9+) for warp-level synchronization instead of relying on implicit warp-synchronous execution - Use
cudaDeviceSynchronize()sparingly in production; prefer stream synchronization withcudaStreamSynchronize() - Use CUDA events (
cudaEventRecord/cudaEventSynchronize) for fine-grained inter-stream ordering
Performance
- Profile before optimizing: use Nsight Compute for kernel analysis, Nsight Systems for system-level view
- Target >50% theoretical occupancy; use the CUDA Occupancy Calculator to tune block dimensions
- Aim for >60% of peak memory bandwidth in memory-bound kernels
- Avoid warp divergence: ensure threads within a warp take the same branch when possible
- Prefer
floatoverdoubleon consumer GPUs (2x throughput difference) - Minimize atomic operations on global memory; use shared memory atomics with a final reduction
Memory Hierarchy
Understanding the memory hierarchy is the single most important factor in CUDA performance.
| Memory Type | Scope | Latency (cycles) | Size | Cached | Read/Write |
|---|---|---|---|---|---|
| Registers | Thread | 1 | ~255 per thread | N/A | R/W |
| Shared | Block | ~5 | 48-164 KB per SM | N/A | R/W |
| L1 Cache | SM | ~28 | 48-192 KB per SM | Auto | R |
| L2 Cache | Device | ~200 | 4-40 MB | Auto | R/W |
| Global | Device | ~400-600 | 4-80 GB (HBM/GDDR) | Yes | R/W |
| Constant | Device | ~5 (cached) | 64 KB | Yes (broadcast) | R |
| Texture | Device | ~400 (cached) | Global pool | Yes (spatial) | R |
Decision guide:
- Data reused within a thread -> registers (automatic via local variables)
- Data shared across threads in a block ->
__shared__memory - Read-only data broadcast to all threads ->
__constant__memory - Large read-only data with spatial locality -> texture memory
- Everything else -> global memory with coalesced access patterns
Key Patterns
Kernel Launch Configuration
// Query device for optimal configuration
void launch_optimized(const float* input, float* output, int n) {
int block_size;
int min_grid_size;
// Let the runtime suggest optimal block size for maximum occupancy
cudaOccupancyMaxPotentialBlockSize(
&min_grid_size, &block_size, my_kernel, 0, n);
int grid_size = (n + block_size - 1) / block_size;
my_kernel<<<grid_size, block_size>>>(input, output, n);
CUDA_CHECK_KERNEL();
}
Coalesced Memory Access
// BAD: Strided access -- adjacent threads access non-adjacent memory
// Each warp issues 32 separate memory transactions
__global__ void transpose_naive(const float* in, float* out, int W, int H) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x < W && y < H) {
out[x * H + y] = in[y * W + x]; // Write is strided
}
}
// GOOD: Use shared memory to coalesce both reads and writes
__global__ void transpose_coalesced(
const float* in, float* out, int W, int H
) {
__shared__ float tile[32][33]; // +1 padding avoids bank conflicts
int x = blockIdx.x * 32 + threadIdx.x;
int y = blockIdx.y * 32 + threadIdx.y;
if (x < W && y < H) {
tile[threadIdx.y][threadIdx.x] = in[y * W + x]; // Coalesced read
}
__syncthreads();
x = blockIdx.y * 32 + threadIdx.x;
y = blockIdx.x * 32 + threadIdx.y;
if (x < H && y < W) {
out[y * H + x] = tile[threadIdx.x][threadIdx.y]; // Coalesced write
}
}
Shared Memory Tiling
// Dot product of two vectors using shared memory reduction
__global__ void dot_product(
const float* a, const float* b, float* result, int n
) {
__shared__ float cache[256];
int tid = threadIdx.x;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Each thread computes its partial sum via grid-stride
float partial = 0.0f;
for (int i = idx; i < n; i += blockDim.x * gridDim.x) {
partial += a[i] * b[i];
}
cache[tid] = partial;
__syncthreads();
// Tree reduction in shared memory
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s) {
cache[tid] += cache[tid + s];
}
__syncthreads();
}
if (tid == 0) {
atomicAdd(result, cache[0]);
}
}
Warp-Level Primitives (CUDA 9+)
// Warp-level reduction using shuffle instructions -- no shared memory needed
__device__ float warp_reduce_sum(float val) {
for (int offset = warpSize / 2; offset > 0; offset /= 2) {
val += __shfl_down_sync(0xFFFFFFFF, val, offset);
}
return val;
}
// Block-level reduction combining warp shuffles and shared memory
__device__ float block_reduce_sum(float val) {
__shared__ float warp_sums[32]; // One slot per warp (max 32 warps/block)
int lane = threadIdx.x % warpSize;
int warp_id = threadIdx.x / warpSize;
val = warp_reduce_sum(val);
if (lane == 0) {
warp_sums[warp_id] = val;
}
__syncthreads();
// First warp reduces the warp sums
int num_warps = (blockDim.x + warpSize - 1) / warpSize;
val = (threadIdx.x < num_warps) ? warp_sums[threadIdx.x] : 0.0f;
if (warp_id == 0) {
val = warp_reduce_sum(val);
}
return val;
}
Performance
Occupancy Calculator
// Query occupancy at compile time for tuning
void report_occupancy() {
int block_size = 256;
int num_blocks;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&num_blocks, my_kernel, block_size, 0);
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
int active_warps = num_blocks * (block_size / prop.warpSize);
int max_warps = prop.maxThreadsPerMultiProcessor / prop.warpSize;
float occupancy = (float)active_warps / max_warps;
printf("Occupancy: %.1f%% (%d/%d warps)\n",
occupancy * 100, active_warps, max_warps);
}
Nsight Profiling Workflow
# System-level trace: find CPU/GPU idle gaps, stream concurrency
nsys profile -o trace ./program
nsys stats trace.nsys-rep
# Kernel-level analysis: roofline, memory throughput, occupancy
ncu --set full -o kernel_report ./program
ncu -i kernel_report.ncu-rep # Open in Nsight Compute GUI
# Quick single-metric check
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed ./program
Memory Bandwidth Measurement
// Measure effective bandwidth of a kernel
void measure_bandwidth(int n) {
size_t bytes = 2 * n * sizeof(float); // Read A + Write B
cudaEvent_t start, stop;
CUDA_CHECK(cudaEventCreate(&start));
CUDA_CHECK(cudaEventCreate(&stop));
CUDA_CHECK(cudaEventRecord(start));
copy_kernel<<<grid, block>>>(d_in, d_out, n);
CUDA_CHECK(cudaEventRecord(stop));
CUDA_CHECK(cudaEventSynchronize(stop));
float ms = 0;
CUDA_CHECK(cudaEventElapsedTime(&ms, start, stop));
float gb_per_sec = bytes / (ms * 1e6);
printf("Effective bandwidth: %.2f GB/s\n", gb_per_sec);
CUDA_CHECK(cudaEventDestroy(start));
CUDA_CHECK(cudaEventDestroy(stop));
}
Tooling
Essential Commands
# Compile CUDA code
nvcc -arch=sm_80 -O3 -o program main.cu # Single file
nvcc -arch=native -lineinfo -o program main.cu # With debug line info
# CMake build
cmake -B build -DCMAKE_CUDA_ARCHITECTURES="70;80;86"
cmake --build build -j$(nproc)
# Runtime debugging
compute-sanitizer ./program # Memory errors (replaces cuda-memcheck)
compute-sanitizer --tool racecheck ./program # Shared memory race conditions
compute-sanitizer --tool initcheck ./program # Uninitialized device memory reads
compute-sanitizer --tool synccheck ./program # Synchronization errors
# Profiling
nsys profile ./program # System-level timeline
ncu ./program # Kernel-level metrics
ncu --kernel-name my_kernel --launch-skip 2 --launch-count 1 ./program
# Device info
nvidia-smi # GPU status and memory usage
nvcc --version # CUDA compiler version
CMakeLists.txt Template
cmake_minimum_required(VERSION 3.18)
project(myproject LANGUAGES CXX CUDA)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CUDA_STANDARD 17)
set(CMAKE_CUDA_ARCHITECTURES 70 80 86)
set(CMAKE_CUDA_SEPARABLE_COMPILATION ON)
find_package(CUDAToolkit REQUIRED)
add_library(kernels src/kernels.cu)
target_include_directories(kernels PUBLIC include)
add_executable(main src/main.cpp)
target_link_libraries(main kernels CUDA::cudart)
enable_testing()
add_executable(tests tests/test_kernels.cu)
target_link_libraries(tests kernels CUDA::cudart)
add_test(NAME gpu_tests COMMAND tests)
References
For detailed patterns and examples, see:
- references/patterns.md -- Tiled matrix multiply, parallel reduction tree, stream overlap pipeline
External References
Weekly Installs
6
Repository
ar4mirez/samuelGitHub Stars
3
First Seen
Mar 1, 2026
Security Audits
Installed on
cline6
gemini-cli6
github-copilot6
codex6
kimi-cli6
cursor6