cutlass-skill
CUTLASS & CuTeDSL Development
Source Code Locations
CUTLASS 源码位于此 skill 安装目录下的 repos/cutlass/。
实际路径取决于所用工具:
- Cursor:
~/.cursor/skills/cutlass-skill/repos/cutlass/ - Claude Code:
~/.claude/skills/cutlass-skill/repos/cutlass/ - Codex:
~/.agents/skills/cutlass-skill/repos/cutlass/
CUTLASS_REPO: 下文示例用 ~/.cursor/skills/cutlass-skill/repos/cutlass/ 作占位符,替换为实际路径。
如果该路径不存在,在项目目录下运行 bash update-repos.sh cutlass。
CuTeDSL (Python DSL for GPU Kernels)
CUTLASS_REPO/python/CuTeDSL/
├── cutlass/
│ ├── base_dsl/ # DSL 基础: 类型, 变量, 函数, PTX emit
│ ├── cute/ # CuTe Python 绑定: Layout, Tensor, TiledMMA, TiledCopy
│ ├── cutlass_dsl/ # CUTLASS DSL: GEMM builder, epilogue, pipeline
│ ├── pipeline/ # 流水线抽象: MainloopPipeline, PipelineAsync
│ ├── jax/ # JAX 集成
│ ├── utils/ # 编译工具, profiler, tensor 工具
│ └── torch.py # PyTorch 集成
CuTeDSL 示例:
CUTLASS_REPO/examples/python/CuTeDSL/
├── ampere/ # Ampere: sgemm, tensorop_gemm, flash_attention_v2
├── hopper/ # Hopper: TMA gemm, FP8, grouped GEMM
├── blackwell/ # Blackwell: blockwise_gemm
├── blackwell_geforce/ # GeForce Blackwell
├── cute/ # CuTe tutorials (Python)
├── distributed/ # 分布式 GEMM
├── experimental/ # 实验性功能
├── jax/ # JAX 示例
├── notebooks/ # Jupyter notebooks (英文)
├── notebooks-zh/ # Jupyter notebooks (中文)
└── advanced_compiler_control/ # 高级编译控制
CUTLASS C++ Examples (按架构分类)
CUTLASS_REPO/examples/
├── 00-47: Ampere 及更早架构
├── 48-69: Hopper (sm_90)
│ ├── 48_hopper_warp_specialized_gemm
│ ├── 49_hopper_gemm_with_collective_builder
│ ├── 54_hopper_fp8_warp_specialized_gemm
│ ├── 55_hopper_mixed_dtype_gemm
│ ├── 57_hopper_grouped_gemm
│ ├── 62_hopper_sparse_gemm
│ ├── 67_hopper_fp8..._blockwise_scaling
│ ├── 88_hopper_fmha
│ └── ...
├── 70-93: Blackwell (sm_100)
│ ├── 70_blackwell_gemm
│ ├── 71_blackwell_gemm_with_collective_builder
│ ├── 72_blackwell_narrow_precision_gemm
│ ├── 77_blackwell_fmha
│ ├── 81_blackwell_gemm_blockwise
│ ├── 83_blackwell_sparse_gemm
│ ├── 92_blackwell_moe_gemm
│ ├── 93_blackwell_low_latency_gqa
│ └── ...
└── cute/tutorial/ # CuTe C++ tutorials (sgemm, tiled_copy, hopper, blackwell)
CuTe C++ Headers
CUTLASS_REPO/include/cute/
├── layout.hpp # Layout 核心: Shape, Stride, 组合
├── tensor.hpp # Tensor: make_tensor, local_tile, partition
├── swizzle.hpp # Swizzle 模式
├── algorithm/ # copy, gemm, fill, clear
├── arch/ # 架构特定: copy_sm90, mma_sm90, copy_sm100
├── atom/ # MMA atom, Copy atom 定义
│ ├── mma_atom.hpp
│ ├── copy_atom.hpp
│ └── mma_traits_sm90_gmma.hpp # WGMMA traits
├── numeric/ # 数值类型
└── container/ # tuple, array
CUTLASS C++ Headers
CUTLASS_REPO/include/cutlass/
├── gemm/ # GEMM 设备层, collective, kernel
│ ├── collective/ # CollectiveMainloop, CollectiveEpilogue
│ ├── kernel/ # GemmUniversal
│ └── device/ # 设备启动接口
├── epilogue/ # Epilogue: bias, activation, scaling
├── conv/ # 卷积
├── arch/ # MMA 指令包装 (mma_sm90.h, mma_sm100.h)
├── pipeline/ # Pipeline: PipelineTmaAsync, PipelineAsync
├── experimental/ # 实验性 API
└── detail/ # 内部实现细节
pycute (Python CuTe 绑定)
CUTLASS_REPO/python/pycute/
├── layout.py # Layout, make_layout, complement, coalesce
├── int_tuple.py # IntTuple 操作
├── swizzle.py # Swizzle
└── typing.py # 类型定义
Search Strategy
用 Grep 工具搜索,不要整文件加载。
CuTeDSL 用法
CUTLASS_REPO="$HOME/.cursor/skills/cutlass-skill/repos/cutlass"
# 查找 CuTeDSL GEMM 示例
rg "@jit" $CUTLASS_REPO/examples/python/CuTeDSL/
# 查找 TiledMMA 使用
rg "TiledMMA\|tiled_mma" $CUTLASS_REPO/python/CuTeDSL/cutlass/cute/
# 查找 pipeline 用法
rg "MainloopPipeline\|PipelineAsync" $CUTLASS_REPO/python/CuTeDSL/cutlass/pipeline/
# 查找 Blackwell CuTeDSL 示例
rg "sm_100\|blackwell" $CUTLASS_REPO/examples/python/CuTeDSL/blackwell/
CuTe C++ 用法
# 查找 Layout 操作
rg "make_layout\|composition\|complement" $CUTLASS_REPO/include/cute/layout.hpp
# 查找 TiledCopy 使用
rg "TiledCopy\|make_tiled_copy" $CUTLASS_REPO/include/cute/
# 查找 MMA atom traits
rg "MMA_Traits" $CUTLASS_REPO/include/cute/atom/
# 查找 Hopper WGMMA
rg "SM90_64x" $CUTLASS_REPO/include/cute/atom/mma_traits_sm90_gmma.hpp
# 查找 TMA copy
rg "SM90_TMA" $CUTLASS_REPO/include/cute/arch/
CUTLASS Collective Builder
# 查找 CollectiveBuilder 使用
rg "CollectiveBuilder" $CUTLASS_REPO/examples/49_hopper_gemm_with_collective_builder/
# 查找 Collective Mainloop
rg "CollectiveMainloop" $CUTLASS_REPO/include/cutlass/gemm/collective/
# 查找 Epilogue 融合
rg "fusion\|EVT" $CUTLASS_REPO/include/cutlass/epilogue/
# 查找 kernel 启动模板
rg "GemmUniversal" $CUTLASS_REPO/include/cutlass/gemm/device/
GEMM 示例搜索
# 查找 FP8 GEMM 配置
rg "float_e4m3\|float_e5m2\|fp8" $CUTLASS_REPO/examples/54_hopper_fp8_warp_specialized_gemm/
# 查找 blockwise scaling
rg "blockwise\|block_scale" $CUTLASS_REPO/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/
# 查找 grouped GEMM
rg "grouped\|ProblemShape::Group" $CUTLASS_REPO/examples/57_hopper_grouped_gemm/
# 查找 sparse GEMM
rg "sparse\|Sparse" $CUTLASS_REPO/examples/62_hopper_sparse_gemm/
# 查找 StreamK
rg "StreamK\|stream_k" $CUTLASS_REPO/examples/47_ampere_gemm_universal_streamk/
When to Use Each Source
| Need | Source | Path |
|---|---|---|
| CuTeDSL 入门 | CuTeDSL examples | examples/python/CuTeDSL/ampere/ |
| CuTeDSL Hopper GEMM | CuTeDSL examples | examples/python/CuTeDSL/hopper/ |
| CuTeDSL Blackwell GEMM | CuTeDSL examples | examples/python/CuTeDSL/blackwell/ |
| CuTeDSL API 定义 | CuTeDSL source | python/CuTeDSL/cutlass/ |
| CuTe Layout 语义 | CuTe headers | include/cute/layout.hpp |
| CuTe Tensor 操作 | CuTe headers | include/cute/tensor.hpp |
| MMA atom traits | CuTe atom | include/cute/atom/ |
| TMA copy 架构 | CuTe arch | include/cute/arch/copy_sm90* |
| CUTLASS GEMM 模板 | CUTLASS examples | examples/48-93_* |
| Collective Builder | CUTLASS examples | examples/49_hopper_gemm_with_collective_builder/ |
| Collective Mainloop | CUTLASS headers | include/cutlass/gemm/collective/ |
| Epilogue 融合 | CUTLASS headers | include/cutlass/epilogue/ |
| Pipeline 抽象 | CUTLASS headers | include/cutlass/pipeline/ |
| pycute Layout 操作 | pycute | python/pycute/ |
| Hopper FMHA | CUTLASS examples | examples/88_hopper_fmha/ |
| Blackwell FMHA | CUTLASS examples | examples/77_blackwell_fmha/ |
| MoE GEMM | CUTLASS examples | examples/92_blackwell_moe_gemm/ |
CuTeDSL 编写模式
基本 Elementwise Kernel
from cutlass import jit, Int32, Float32
@jit
def add_kernel(x: Float32, y: Float32) -> Float32:
return x + y
参考 examples/python/CuTeDSL/ampere/elementwise_add.py 获取完整示例。
CuTeDSL GEMM
参考 examples/python/CuTeDSL/ampere/sgemm.py 获取基础 SGEMM。
参考 examples/python/CuTeDSL/hopper/ 获取 Hopper TMA GEMM。
CuTe C++ GEMM 模式
参考 examples/cute/tutorial/sgemm_1.cu ~ sgemm_sm80.cu 获取 CuTe SGEMM 渐进教程。
Compilation Reference
# 编译 CUTLASS example
cd CUTLASS_REPO && mkdir -p build && cd build
cmake .. -DCUTLASS_NVCC_ARCHS=90a # Hopper
cmake --build . --target 49_hopper_gemm_with_collective_builder
# 编译特定架构
cmake .. -DCUTLASS_NVCC_ARCHS="80;90a;100a"
# CuTeDSL 运行
pip install -e python/CuTeDSL/
python examples/python/CuTeDSL/ampere/sgemm.py
常见问题排查
| 问题 | 可能原因 | 查找参考 |
|---|---|---|
| GEMM 精度不对 | Epilogue 未配置正确的 accumulator 类型 | rg "ElementAccumulator" examples/49_* |
| TMA 报错 | Tensor alignment 不满足 128B | rg "Alignment|alignment" examples/48_* |
| Collective Builder 编译失败 | 架构不匹配或 pipeline 配置错误 | rg "KernelSchedule|EpilogueSchedule" examples/49_* |
| CuTeDSL 类型错误 | DSL 类型与 CUDA 类型不匹配 | rg "dtype|element_type" examples/python/CuTeDSL/ |
| Layout swizzle 错误 | Swizzle 模式与数据排布不兼容 | include/cute/swizzle.hpp |
更新 CUTLASS 源码
# 在 cursor-gpu-skills 项目目录下
bash update-repos.sh cutlass
Additional References
- CUTLASS 官方文档: https://github.com/NVIDIA/cutlass
- CuTe 文档:
CUTLASS_REPO/media/docs/cute/(如果使用 --full 模式安装) - CuTeDSL notebooks:
examples/python/CuTeDSL/notebooks/
More from slowlyc/agent-gpu-skills
cuda-skill
Query NVIDIA PTX ISA 9.1, CUDA Runtime API 13.1, Driver API 13.1, Programming Guide v13.1, Best Practices Guide, Nsight Compute, Nsight Systems local documentation. Debug and optimize GPU kernels with nsys/ncu/compute-sanitizer workflows. Use when writing, debugging, or optimizing CUDA code, GPU kernels, PTX instructions, inline PTX, TensorCore operations (WMMA, WGMMA, TMA, tcgen05), or when the user mentions CUDA API functions, error codes, device properties, memory management, profiling, GPU performance, compute capabilities, CUDA Graphs, Cooperative Groups, Unified Memory, dynamic parallelism, CUDA programming model concepts, bank conflicts, shared memory optimization, warp divergence, memory coalescing, occupancy tuning, register pressure, L2 cache control, async copy, mbarrier, thread block clusters, or CUDA architecture questions (Ampere sm_80, Hopper sm_90, Blackwell sm_100).
53triton-skill
Write, debug, and optimize Triton and Gluon GPU kernels using local source code, tutorials, and kernel references. Use when the user mentions Triton, Gluon, tl.load, tl.store, tl.dot, tl.dot_scaled, triton.jit, gluon.jit, wgmma, tcgen05, TMA, tensor descriptor, persistent kernel, warp specialization, fused attention, matmul kernel, kernel fusion, tl.program_id, triton autotune, MXFP, FP8, FP4, NVFP4, block-scaled matmul, SwiGLU, top-k, triton_kernels, roofline analysis, Triton IR, TritonGPU dialect, MLIR Triton, PDL (programmatic dependent launch), cluster launch control, or asks about writing GPU kernels in Python. Also use when the user wants to understand Triton compiler internals, debug Triton kernel correctness, profile Triton kernel performance, or convert CUDA kernels to Triton.
51