sglang-skill
SKILL.md
SGLang Development
Source Code Locations
SGLang 源码本地路径(由 install.sh 从 GitHub clone):
SGLANG_REPO: ~/.cursor/skills/sglang-skill/repos/sglang/
如果该路径不存在,运行:
# 在 cursor-gpu-skills 项目目录下运行
bash update-repos.sh sglang
Core Runtime (SRT)
SGLANG_REPO/python/sglang/srt/
├── layers/
│ ├── attention/ # Attention backends
│ │ ├── flashinfer_backend.py # FlashInfer (默认)
│ │ ├── flashinfer_mla_backend.py # FlashInfer MLA (DeepSeek)
│ │ ├── cutlass_mla_backend.py # CUTLASS MLA
│ │ ├── flashattention_backend.py # FlashAttention
│ │ ├── triton_backend.py # Triton attention
│ │ ├── flashmla_backend.py # FlashMLA
│ │ ├── nsa_backend.py # Native Sparse Attention
│ │ ├── tbo_backend.py # TBO
│ │ ├── fla/ # Flash Linear Attention
│ │ ├── triton_ops/ # Triton attention ops
│ │ └── wave_ops/ # Wave attention ops
│ ├── moe/ # MoE routing and dispatch
│ ├── quantization/ # FP8, GPTQ, AWQ, Marlin, etc.
│ ├── deep_gemm_wrapper/ # DeepGEMM 集成
│ └── utils/
├── models/ # 模型实现 (LLaMA, DeepSeek, Qwen, etc.)
│ └── deepseek_common/ # DeepSeek V2/V3 共享组件
├── managers/ # Scheduler, TokenizerManager, Detokenizer
├── mem_cache/ # KV cache, Radix cache
├── model_executor/ # 模型执行器, forward batch
├── model_loader/ # 模型加载, 权重映射
├── entrypoints/ # 启动入口: Engine, OpenAI API server
├── speculative/ # Speculative decoding
├── disaggregation/ # Disaggregated prefill/decode
├── distributed/ # TP/PP/EP 分布式
├── compilation/ # CUDA Graph, Torch.compile
├── configs/ # 模型配置
├── lora/ # LoRA 推理
├── eplb/ # Expert-level load balancing
├── hardware_backend/ # 硬件适配 (CUDA, ROCm, XPU)
└── utils/ # 工具函数
JIT Kernels (Python CUDA/Triton Kernels)
SGLANG_REPO/python/sglang/jit_kernel/
├── flash_attention/ # Flash Attention 自定义实现
├── flash_attention_v4.py # Flash Attention v4
├── cutedsl_gdn.py # CuTeDSL GDN kernel
├── concat_mla.py # MLA concat kernel
├── norm.py # Normalization kernels
├── rope.py # RoPE position encoding
├── pos_enc.py # Position encoding
├── per_tensor_quant_fp8.py # FP8 量化
├── kvcache.py # KV cache kernels
├── hicache.py # HiCache kernels
├── gptq_marlin.py # GPTQ Marlin kernel
├── cuda_wait_value.py # CUDA sync primitives
└── diffusion/ # Diffusion model kernels
sgl-kernel (C++/CUDA Custom Kernels)
SGLANG_REPO/sgl-kernel/
├── csrc/
│ ├── attention/ # Custom attention CUDA kernels
│ ├── cutlass_extensions/ # CUTLASS GEMM extensions
│ ├── gemm/ # GEMM kernels
│ ├── moe/ # MoE dispatch/combine kernels
│ ├── quantization/ # Quantization CUDA kernels
│ ├── allreduce/ # AllReduce CUDA kernels
│ ├── speculative/ # Speculative decoding kernels
│ ├── kvcacheio/ # KV cache I/O
│ ├── mamba/ # Mamba SSM kernels
│ ├── memory/ # Memory management
│ └── grammar/ # Grammar-guided generation
├── include/ # C++ headers
├── python/ # Python bindings
├── tests/ # Kernel tests
└── benchmark/ # Kernel benchmarks
Frontend Language
SGLANG_REPO/python/sglang/lang/ # SGLang 前端 DSL
SGLANG_REPO/examples/ # 使用示例
SGLANG_REPO/benchmark/ # 性能基准
SGLANG_REPO/test/ # 测试套件
SGLANG_REPO/docs/ # 文档
Search Strategy
用 Grep 工具搜索,不要整文件加载。
Attention 和 MLA
SGLANG_REPO="$HOME/.cursor/skills/sglang-skill/repos/sglang"
# 查找 attention backend 注册
rg "register\|Backend" $SGLANG_REPO/python/sglang/srt/layers/attention/attention_registry.py
# 查找 FlashInfer MLA 实现
rg "forward\|mla" $SGLANG_REPO/python/sglang/srt/layers/attention/flashinfer_mla_backend.py
# 查找 CUTLASS MLA
rg "cutlass\|mla" $SGLANG_REPO/python/sglang/srt/layers/attention/cutlass_mla_backend.py
# 查找 attention 通用接口
rg "class.*Backend\|def forward" $SGLANG_REPO/python/sglang/srt/layers/attention/base_attn_backend.py
Scheduler 和 Batching
# Scheduler 核心逻辑
rg "class Scheduler\|def get_next_batch" $SGLANG_REPO/python/sglang/srt/managers/
# Continuous batching 和 chunked prefill
rg "chunk\|prefill\|extend" $SGLANG_REPO/python/sglang/srt/managers/
# CUDA Graph
rg "cuda_graph\|CudaGraph" $SGLANG_REPO/python/sglang/srt/compilation/
KV Cache 和 Memory
# Radix cache 实现
rg "RadixCache\|radix" $SGLANG_REPO/python/sglang/srt/mem_cache/
# KV cache 管理
rg "class.*Pool\|allocate\|free" $SGLANG_REPO/python/sglang/srt/mem_cache/
# HiCache (hierarchical cache)
rg "HiCache\|hicache" $SGLANG_REPO/python/sglang/srt/mem_cache/
模型相关
# 查找特定模型实现
rg "class.*ForCausalLM" $SGLANG_REPO/python/sglang/srt/models/
# DeepSeek V2/V3 实现
rg "DeepSeek\|MLA\|MoE" $SGLANG_REPO/python/sglang/srt/models/deepseek_v2.py
# 模型加载和权重映射
rg "load_weight\|weight_map" $SGLANG_REPO/python/sglang/srt/model_loader/
MoE
# MoE routing
rg "TopK\|router\|expert" $SGLANG_REPO/python/sglang/srt/layers/moe/
# MoE CUDA kernels
rg "moe" $SGLANG_REPO/sgl-kernel/csrc/moe/
量化
# FP8 量化
rg "fp8\|float8" $SGLANG_REPO/python/sglang/srt/layers/quantization/
# GPTQ/AWQ/Marlin
rg "gptq\|awq\|marlin" $SGLANG_REPO/python/sglang/srt/layers/quantization/
Speculative Decoding
rg "speculative\|draft\|verify" $SGLANG_REPO/python/sglang/srt/speculative/
分布式
# TP/PP/EP
rg "tensor_parallel\|pipeline_parallel\|expert_parallel" $SGLANG_REPO/python/sglang/srt/distributed/
# Disaggregated serving
rg "disagg\|prefill_worker\|decode_worker" $SGLANG_REPO/python/sglang/srt/disaggregation/
When to Use Each Source
| Need | Source | Path |
|---|---|---|
| Attention backend 接口 | SRT layers | srt/layers/attention/base_attn_backend.py |
| FlashInfer attention | SRT layers | srt/layers/attention/flashinfer_backend.py |
| MLA (DeepSeek) | SRT layers | srt/layers/attention/*mla*.py |
| MoE routing/dispatch | SRT layers | srt/layers/moe/ |
| 量化 (FP8/GPTQ/AWQ) | SRT layers | srt/layers/quantization/ |
| Scheduler | SRT managers | srt/managers/ |
| KV cache / Radix cache | SRT mem_cache | srt/mem_cache/ |
| 模型实现 | SRT models | srt/models/ |
| DeepSeek V2/V3 | SRT models | srt/models/deepseek_v2.py, deepseek_common/ |
| Speculative decoding | SRT speculative | srt/speculative/ |
| Disaggregated serving | SRT disagg | srt/disaggregation/ |
| TP/PP/EP 分布式 | SRT distributed | srt/distributed/ |
| CUDA Graph | SRT compilation | srt/compilation/ |
| 模型加载 | SRT model_loader | srt/model_loader/ |
| 启动入口 | SRT entrypoints | srt/entrypoints/ |
| JIT Triton kernels | jit_kernel | jit_kernel/ |
| Custom CUDA kernels | sgl-kernel | sgl-kernel/csrc/ |
| CUTLASS extensions | sgl-kernel | sgl-kernel/csrc/cutlass_extensions/ |
| 前端 DSL | lang | python/sglang/lang/ |
| 使用示例 | examples | examples/ |
常见开发场景
添加新 Attention Backend
- 继承
base_attn_backend.py中的AttnBackend - 实现
forward()方法 - 在
attention_registry.py注册 - 参考
flashinfer_backend.py作为模板
添加新模型
- 在
srt/models/创建模型文件 - 实现
ForCausalLM类 - 实现
load_weights()方法 - 参考
srt/models/llama.py作为模板
添加新量化方法
- 在
srt/layers/quantization/添加量化模块 - 注册到量化工厂
- 参考
fp8_kernel.py或gptq.py
启动和调试
# 启动 OpenAI 兼容 API server
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 1
# 使用 Engine API (Python)
from sglang import Engine
engine = Engine(model_path="meta-llama/Meta-Llama-3-8B-Instruct")
# Profiling
python -m sglang.launch_server --model-path ... --enable-torch-compile
nsys profile -o report python -m sglang.launch_server ...
更新 SGLang 源码
# 在 cursor-gpu-skills 项目目录下
bash update-repos.sh sglang
Additional References
- SGLang 文档: https://docs.sglang.ai/
- GitHub: https://github.com/sgl-project/sglang
Weekly Installs
1
Repository
slowlyc/cursor-…u-skillsGitHub Stars
48
First Seen
14 days ago
Security Audits
Installed on
cursor1