tilelang-developer
TileLang Developer
Write high-performance AI compute kernels using TileLang - a tile-based programming model that bridges the gap between CUDA's low-level control and high-level abstractions.
When to Use This Skill
Use this skill when the user needs to:
- Implement custom GPU kernels for AI operations (matrix multiplication, attention mechanisms, etc.)
- Optimize performance-critical operators for modern GPUs (NVIDIA Ampere/Hopper, AMD MI300X, Ascend NPU)
- Debug TileLang code or resolve performance issues
- Port kernels across different hardware platforms
- Understand or modify existing TileLang implementations
Kernel Development Workflow
Follow these steps when writing a TileLang kernel:
Step 1: Analyze Requirements
Gather essential information:
Input/Output Specifications:
- Tensor shapes (M, N, K dimensions)
- Data types (float16, float32, bfloat16, int8)
- Memory layout (row-major, column-major)
Hardware Target:
- NVIDIA GPU (Ampere A100, Hopper H100, etc.)
- AMD GPU (MI300X, etc.)
- Huawei Ascend NPU
Performance Goals:
- Target throughput or latency
- Memory bandwidth constraints
- Comparison baseline (cuBLAS, vendor libraries)
Ask clarifying questions if details are missing.
Step 2: Set Up Kernel Structure
Create the basic kernel scaffold:
import tilelang
import tilelang.language as T
@tilelang.jit(target="cuda", out_idx=[2]) # Specify output indices
def kernel_name(M, N, K, block_M, block_N, block_K):
@T.prim_func
def main(
A: T.Buffer((M, K), "float16"),
B: T.Buffer((K, N), "float16"),
C: T.Buffer((M, N), "float16")
):
# Kernel logic will go here
pass
return main
Key decisions:
target: "cuda" (NVIDIA), "hip" (AMD), or "cpu"out_idx: List indices of output parameters- Block dimensions: Typical values are 64, 128, or 256
Step 3: Define Grid and Memory Hierarchy
Set up computation grid and allocate memory:
# Define grid dimensions
with T.Kernel(
T.ceildiv(N, block_N), # Grid X
T.ceildiv(M, block_M), # Grid Y
threads=128
) as (bx, by):
# Allocate shared memory (L1 cache)
A_shared = T.alloc_shared((block_M, block_K), "float16")
B_shared = T.alloc_shared((block_K, block_N), "float16")
# Allocate register fragments (accumulators)
C_local = T.alloc_fragment((block_M, block_N), "float32")
# CRITICAL: Apply swizzle layout to avoid bank conflicts
T.annotate_layout({
A_shared: T.make_swizzled_layout(A_shared),
B_shared: T.make_swizzled_layout(B_shared)
})
Memory hierarchy:
- Global Memory (HBM): Input/output tensors, slowest
- Shared Memory (L1): Explicitly managed cache, ~164KB on A100
- Registers: Fastest, used for accumulators and temporaries
Critical optimization: Always apply T.make_swizzled_layout to shared memory to eliminate bank conflicts.
Step 4: Implement Computation Logic
Use TileLang primitives for data movement and computation:
# Initialize accumulator
T.clear(C_local)
# Main computation loop with software pipelining
for k in T.Pipelined(T.ceildiv(K, block_K), num_stages=3):
# Load tiles from global to shared memory
T.copy(A[by * block_M, k * block_K], A_shared)
T.copy(B[k * block_K, bx * block_N], B_shared)
# Compute using Tensor Cores
T.gemm(A_shared, B_shared, C_local, transpose_B=False)
# Write results back
T.copy(C_local, C[by * block_M, bx * block_N])
Key primitives:
T.copy: Intelligent data transfer (auto-selects cp.async, TMA, etc.)T.gemm: Matrix multiplication using Tensor CoresT.Pipelined: Software pipelining to overlap compute and memory transferT.Parallel: Element-wise parallel operations
Pipeline stages:
num_stages=2: Double bufferingnum_stages=3: Triple buffering (recommended for most workloads)num_stages=4+: Diminishing returns, increases shared memory usage
Step 5: Validate and Test
Generate test code to verify correctness:
# Example instantiation
func = kernel_name(
M=1024, N=1024, K=1024,
block_M=128, block_N=128, block_K=32
)
# Test against reference implementation
import torch
A = torch.randn(1024, 1024, dtype=torch.float16, device='cuda')
B = torch.randn(1024, 1024, dtype=torch.float16, device='cuda')
C_tilelang = torch.empty(1024, 1024, dtype=torch.float16, device='cuda')
C_reference = A @ B
func(A, B, C_tilelang)
# Verify with appropriate tolerance for FP16
torch.testing.assert_close(C_tilelang, C_reference, rtol=1e-3, atol=1e-3)
Step 6: Optimize Performance
Apply advanced optimizations if performance is suboptimal:
Block Size Tuning:
- A100: Try 128×128×32 or 64×64×32
- H100: Can use larger blocks (256×128×32)
- MI300X: May need smaller blocks due to 64KB shared memory limit
Pipeline Depth:
- Increase
num_stagesif memory-bound - Decrease if shared memory is exhausted
Warp Policy (for advanced cases):
T.gemm(A, B, C, policy=T.GemmWarpPolicy.FullRow) # For attention
T.gemm(A, B, C, policy=T.GemmWarpPolicy.FullCol) # For MLA decode
Block-level swizzle:
T.use_swizzle(panel_size=10) # Improves L2 cache hit rate
Common Kernel Patterns
Matrix Multiplication (GEMM)
Most fundamental kernel. See EXAMPLES.md for complete implementation.
Key features:
- 3-stage pipelining
- Swizzle layout for shared memory
- Float32 accumulator for precision
FlashAttention
Memory-efficient attention with online softmax. See EXAMPLES.md for complete implementation.
Key features:
- O(N) memory complexity
- Online softmax statistics
- Fused kernel (no intermediate materialization)
DeepSeek MLA
Multi-Head Latent Attention with KV compression. See EXAMPLES.md for complete implementation.
Key features:
- Split-KV parallelization
- Non-standard dimensions
- FullCol warp policy for narrow matrices
Reference Documentation
When you need specific information:
- API details (parameters, signatures, options): Read API_REFERENCE.md
- Complete code examples (GEMM, Attention, MLA): Read EXAMPLES.md
- Troubleshooting (errors, performance issues): Read DEBUGGING.md
Critical Performance Guidelines
Always include these optimizations:
-
Swizzle layout for shared memory:
T.annotate_layout({ A_shared: T.make_swizzled_layout(A_shared) }) -
Software pipelining:
for k in T.Pipelined(num_blocks, num_stages=3): -
Float32 accumulators:
C_local = T.alloc_fragment((M, N), "float32") # Not float16 -
Aligned block_K:
block_K = 32 # Or 16, must align for Tensor Core -
Initialize accumulators:
T.clear(C_local)
Hardware-Specific Considerations
NVIDIA GPUs
- Ampere (A100): Use cp.async, num_stages=3, block_K=32
- Hopper (H100): Can use TMA, larger blocks (256×128), num_stages=4
- Shared memory: 164KB (A100), 228KB (H100)
AMD GPUs
- MI300X: Use target="hip", smaller blocks, 64KB shared memory limit
- Test with both HIP and CUDA backends for compatibility
Huawei Ascend
- More experimental backend
- May require specific block sizes
- Consult Ascend-specific documentation
Code Quality Standards
When generating TileLang code:
- Add explanatory comments for non-obvious optimizations
- Specify hardware assumptions (e.g., "optimized for A100")
- Include usage examples showing instantiation
- Document block size choices and tuning rationale
- Provide performance expectations (e.g., "~90% of cuBLAS")
Example Kernel Request Flow
User: "Write a FP16 matrix multiplication kernel for A100"
Response:
- Clarify dimensions (if not specified)
- Generate complete kernel code with:
- Proper structure (@tilelang.jit, @T.prim_func)
- Swizzle layouts
- 3-stage pipelining
- Appropriate block sizes (128×128×32)
- Add usage example
- Explain key optimizations:
- "Swizzle layout eliminates bank conflicts"
- "3-stage pipeline overlaps memory and compute"
- "Float32 accumulator prevents overflow"
- Suggest testing approach
Troubleshooting Quick Reference
Compilation errors:
- Shared memory exceeded → Reduce block size or num_stages
- Shape mismatch → Verify dimension alignment in T.gemm
Runtime errors:
- Results all zeros → Check T.clear() and out_idx in decorator
- NaN/Inf → Use float32 accumulator, add epsilon in division
Performance issues:
- Low throughput → Verify swizzle layout and pipelining enabled
- Low occupancy → Reduce shared memory usage or block size
- Bank conflicts → Apply T.make_swizzled_layout
For detailed solutions, consult DEBUGGING.md.
More from yzlnew/infra-skills
tikz-flowchart
Creates professional TikZ flowcharts with standardized themes, including Google Material-like and Anthropic-inspired options.
106megatron-memory-estimator
Estimate GPU memory usage for Megatron-based MoE (Mixture of Experts) and dense models. Use when users need to (1) estimate memory from HuggingFace model configs (DeepSeek-V3, Qwen, etc.), (2) plan GPU resource allocation for training, (3) compare different parallelism strategies (TP/PP/EP/CP), (4) determine if a model fits in available GPU memory, or (5) optimize training configurations for memory efficiency.
11slime-user
Guide for using SLIME (LLM post-training framework for RL Scaling). Use when working with SLIME for reinforcement learning training of language models, including setup, configuration, training execution, multi-turn interactions, custom reward models, tool calling scenarios, or troubleshooting SLIME workflows. Covers GRPO, GSPO, PPO, Reinforce++, multi-agent RL, VLM training, FSDP/Megatron backends, SGLang integration, dynamic sampling, and custom generation functions.
8material-you-slides
Create presentation slides using Material You (Material Design 3) style. Generates 1280x720 HTML slides with M3 color tokens, Roboto typography, rounded cards, flow diagrams, metric cards, code blocks, and structured layouts. Use when the user asks to create slides, presentations, or decks and wants a clean, modern Material Design 3 aesthetic.
3