sglang-skill

SKILL.md

SGLang Development

Source Code Locations

SGLang 源码本地路径(由 install.sh 从 GitHub clone):

SGLANG_REPO: ~/.cursor/skills/sglang-skill/repos/sglang/

如果该路径不存在,运行:

# 在 cursor-gpu-skills 项目目录下运行
bash update-repos.sh sglang

Core Runtime (SRT)

SGLANG_REPO/python/sglang/srt/
├── layers/
│   ├── attention/          # Attention backends
│   │   ├── flashinfer_backend.py      # FlashInfer (默认)
│   │   ├── flashinfer_mla_backend.py  # FlashInfer MLA (DeepSeek)
│   │   ├── cutlass_mla_backend.py     # CUTLASS MLA
│   │   ├── flashattention_backend.py  # FlashAttention
│   │   ├── triton_backend.py          # Triton attention
│   │   ├── flashmla_backend.py        # FlashMLA
│   │   ├── nsa_backend.py             # Native Sparse Attention
│   │   ├── tbo_backend.py             # TBO
│   │   ├── fla/                       # Flash Linear Attention
│   │   ├── triton_ops/                # Triton attention ops
│   │   └── wave_ops/                  # Wave attention ops
│   ├── moe/                # MoE routing and dispatch
│   ├── quantization/       # FP8, GPTQ, AWQ, Marlin, etc.
│   ├── deep_gemm_wrapper/  # DeepGEMM 集成
│   └── utils/
├── models/                 # 模型实现 (LLaMA, DeepSeek, Qwen, etc.)
│   └── deepseek_common/    # DeepSeek V2/V3 共享组件
├── managers/               # Scheduler, TokenizerManager, Detokenizer
├── mem_cache/              # KV cache, Radix cache
├── model_executor/         # 模型执行器, forward batch
├── model_loader/           # 模型加载, 权重映射
├── entrypoints/            # 启动入口: Engine, OpenAI API server
├── speculative/            # Speculative decoding
├── disaggregation/         # Disaggregated prefill/decode
├── distributed/            # TP/PP/EP 分布式
├── compilation/            # CUDA Graph, Torch.compile
├── configs/                # 模型配置
├── lora/                   # LoRA 推理
├── eplb/                   # Expert-level load balancing
├── hardware_backend/       # 硬件适配 (CUDA, ROCm, XPU)
└── utils/                  # 工具函数

JIT Kernels (Python CUDA/Triton Kernels)

SGLANG_REPO/python/sglang/jit_kernel/
├── flash_attention/        # Flash Attention 自定义实现
├── flash_attention_v4.py   # Flash Attention v4
├── cutedsl_gdn.py          # CuTeDSL GDN kernel
├── concat_mla.py           # MLA concat kernel
├── norm.py                 # Normalization kernels
├── rope.py                 # RoPE position encoding
├── pos_enc.py              # Position encoding
├── per_tensor_quant_fp8.py # FP8 量化
├── kvcache.py              # KV cache kernels
├── hicache.py              # HiCache kernels
├── gptq_marlin.py          # GPTQ Marlin kernel
├── cuda_wait_value.py      # CUDA sync primitives
└── diffusion/              # Diffusion model kernels

sgl-kernel (C++/CUDA Custom Kernels)

SGLANG_REPO/sgl-kernel/
├── csrc/
│   ├── attention/          # Custom attention CUDA kernels
│   ├── cutlass_extensions/ # CUTLASS GEMM extensions
│   ├── gemm/               # GEMM kernels
│   ├── moe/                # MoE dispatch/combine kernels
│   ├── quantization/       # Quantization CUDA kernels
│   ├── allreduce/          # AllReduce CUDA kernels
│   ├── speculative/        # Speculative decoding kernels
│   ├── kvcacheio/          # KV cache I/O
│   ├── mamba/              # Mamba SSM kernels
│   ├── memory/             # Memory management
│   └── grammar/            # Grammar-guided generation
├── include/                # C++ headers
├── python/                 # Python bindings
├── tests/                  # Kernel tests
└── benchmark/              # Kernel benchmarks

Frontend Language

SGLANG_REPO/python/sglang/lang/   # SGLang 前端 DSL
SGLANG_REPO/examples/             # 使用示例
SGLANG_REPO/benchmark/            # 性能基准
SGLANG_REPO/test/                 # 测试套件
SGLANG_REPO/docs/                 # 文档

Search Strategy

用 Grep 工具搜索,不要整文件加载。

Attention 和 MLA

SGLANG_REPO="$HOME/.cursor/skills/sglang-skill/repos/sglang"

# 查找 attention backend 注册
rg "register\|Backend" $SGLANG_REPO/python/sglang/srt/layers/attention/attention_registry.py

# 查找 FlashInfer MLA 实现
rg "forward\|mla" $SGLANG_REPO/python/sglang/srt/layers/attention/flashinfer_mla_backend.py

# 查找 CUTLASS MLA
rg "cutlass\|mla" $SGLANG_REPO/python/sglang/srt/layers/attention/cutlass_mla_backend.py

# 查找 attention 通用接口
rg "class.*Backend\|def forward" $SGLANG_REPO/python/sglang/srt/layers/attention/base_attn_backend.py

Scheduler 和 Batching

# Scheduler 核心逻辑
rg "class Scheduler\|def get_next_batch" $SGLANG_REPO/python/sglang/srt/managers/

# Continuous batching 和 chunked prefill
rg "chunk\|prefill\|extend" $SGLANG_REPO/python/sglang/srt/managers/

# CUDA Graph
rg "cuda_graph\|CudaGraph" $SGLANG_REPO/python/sglang/srt/compilation/

KV Cache 和 Memory

# Radix cache 实现
rg "RadixCache\|radix" $SGLANG_REPO/python/sglang/srt/mem_cache/

# KV cache 管理
rg "class.*Pool\|allocate\|free" $SGLANG_REPO/python/sglang/srt/mem_cache/

# HiCache (hierarchical cache)
rg "HiCache\|hicache" $SGLANG_REPO/python/sglang/srt/mem_cache/

模型相关

# 查找特定模型实现
rg "class.*ForCausalLM" $SGLANG_REPO/python/sglang/srt/models/

# DeepSeek V2/V3 实现
rg "DeepSeek\|MLA\|MoE" $SGLANG_REPO/python/sglang/srt/models/deepseek_v2.py

# 模型加载和权重映射
rg "load_weight\|weight_map" $SGLANG_REPO/python/sglang/srt/model_loader/

MoE

# MoE routing
rg "TopK\|router\|expert" $SGLANG_REPO/python/sglang/srt/layers/moe/

# MoE CUDA kernels
rg "moe" $SGLANG_REPO/sgl-kernel/csrc/moe/

量化

# FP8 量化
rg "fp8\|float8" $SGLANG_REPO/python/sglang/srt/layers/quantization/

# GPTQ/AWQ/Marlin
rg "gptq\|awq\|marlin" $SGLANG_REPO/python/sglang/srt/layers/quantization/

Speculative Decoding

rg "speculative\|draft\|verify" $SGLANG_REPO/python/sglang/srt/speculative/

分布式

# TP/PP/EP
rg "tensor_parallel\|pipeline_parallel\|expert_parallel" $SGLANG_REPO/python/sglang/srt/distributed/

# Disaggregated serving
rg "disagg\|prefill_worker\|decode_worker" $SGLANG_REPO/python/sglang/srt/disaggregation/

When to Use Each Source

Need Source Path
Attention backend 接口 SRT layers srt/layers/attention/base_attn_backend.py
FlashInfer attention SRT layers srt/layers/attention/flashinfer_backend.py
MLA (DeepSeek) SRT layers srt/layers/attention/*mla*.py
MoE routing/dispatch SRT layers srt/layers/moe/
量化 (FP8/GPTQ/AWQ) SRT layers srt/layers/quantization/
Scheduler SRT managers srt/managers/
KV cache / Radix cache SRT mem_cache srt/mem_cache/
模型实现 SRT models srt/models/
DeepSeek V2/V3 SRT models srt/models/deepseek_v2.py, deepseek_common/
Speculative decoding SRT speculative srt/speculative/
Disaggregated serving SRT disagg srt/disaggregation/
TP/PP/EP 分布式 SRT distributed srt/distributed/
CUDA Graph SRT compilation srt/compilation/
模型加载 SRT model_loader srt/model_loader/
启动入口 SRT entrypoints srt/entrypoints/
JIT Triton kernels jit_kernel jit_kernel/
Custom CUDA kernels sgl-kernel sgl-kernel/csrc/
CUTLASS extensions sgl-kernel sgl-kernel/csrc/cutlass_extensions/
前端 DSL lang python/sglang/lang/
使用示例 examples examples/

常见开发场景

添加新 Attention Backend

  1. 继承 base_attn_backend.py 中的 AttnBackend
  2. 实现 forward() 方法
  3. attention_registry.py 注册
  4. 参考 flashinfer_backend.py 作为模板

添加新模型

  1. srt/models/ 创建模型文件
  2. 实现 ForCausalLM
  3. 实现 load_weights() 方法
  4. 参考 srt/models/llama.py 作为模板

添加新量化方法

  1. srt/layers/quantization/ 添加量化模块
  2. 注册到量化工厂
  3. 参考 fp8_kernel.pygptq.py

启动和调试

# 启动 OpenAI 兼容 API server
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 1

# 使用 Engine API (Python)
from sglang import Engine
engine = Engine(model_path="meta-llama/Meta-Llama-3-8B-Instruct")

# Profiling
python -m sglang.launch_server --model-path ... --enable-torch-compile
nsys profile -o report python -m sglang.launch_server ...

更新 SGLang 源码

# 在 cursor-gpu-skills 项目目录下
bash update-repos.sh sglang

Additional References

Weekly Installs
1
GitHub Stars
48
First Seen
14 days ago
Installed on
cursor1