sglang-deepseek-v32-optimization
SGLang DeepSeek V3.2 Optimization
Overview
This skill covers the DeepSeek V3.2 support and optimization ladder active in SGLang main. V3.2 shares the DeepSeek V3/R1 model backbone, but it is a separate optimization problem because it activates DeepSeek Sparse Attention, called DSA in docs and NSA in SGLang code.
Current-main snapshot:
- SGLang
origin/main:929e00eeaon2026-04-21 - sgl-cookbook
origin/main:8ec4d03on2026-04-21 - V3.2 runtime entry:
DeepseekV32ForCausalLMinpython/sglang/srt/models/deepseek_v2.py - NSA backend:
python/sglang/srt/layers/attention/nsa_backend.py - NSA indexer:
python/sglang/srt/layers/attention/nsa/nsa_indexer.py - V3.2 tool parser:
python/sglang/srt/function_call/deepseekv32_detector.py
The historical evidence lives in:
- references/pr-history.md: chronological PR evidence and code-level notes
- references/playbook.md: investigation order, symptom mapping, validation commands
Before You Change Anything
Record the exact serving shape first:
- model: V3.2-Exp, V3.2, V3.2-Speciale, V3.2-NVFP4, or V3.2-MXFP4
- whether
is_deepseek_nsa(config)is true --attention-backend,--nsa-prefill-backend,--nsa-decode-backend- KV cache dtype:
auto,bfloat16,fp8_e4m3, or experimental FP4 tracks - TP / DP / EP / PP / PD topology
--enable-dp-attention--enable-nsa-prefill-context-parallel--nsa-prefill-cp-mode:round-robin-splitorin-seq-split- MTP enabled or not
- IndexCache knobs:
index_topk_freq,index_topk_pattern - tool parser: V3.2-Exp may use
deepseekv31in the cookbook path, standard V3.2 usesdeepseekv32 - reasoning parser:
--reasoning-parser deepseek-v3 - hardware: H200, B200/GB200/GB300, AMD MI300/MI355, NPU, or another backend
Core Principle
Do not treat V3.2 as ordinary DeepSeek V3.
- V3.2 turns on DSA/NSA through
is_deepseek_nsa(config). - The attention hot path is split between the indexer, top-k transform, sparse MLA backend, and KV-cache quant/dequant.
- Server defaults are model-specific: attention backend becomes
nsa, KV cache dtype defaults differ by architecture, and NSA prefill/decode backends are auto-selected. - Context parallel is experimental and has strict mode-specific constraints.
- MTP spans the NextN layer, NSA metadata, target_verify, draft_extend, CP positions, and speculative overlap.
- V3.2 parser behavior is DSML for standard V3.2, while V3.2-Exp docs still point at the V3.1-style parser path.
The optimization order matters:
- confirm DSA detection and server defaults
- confirm KV cache dtype and NSA backend pair
- validate indexer top-k generation and transform
- validate MTP, CP, PP, or DP attention only after base DSA is correct
- then tune backend-specific kernels for Blackwell, Hopper, AMD, or NPU
- add model-backed tests for any IndexCache, MTP, CP, or backend change
Main Runtime Surfaces
Start from these files before changing behavior:
python/sglang/srt/models/deepseek_v2.pypython/sglang/srt/models/deepseek_nextn.pypython/sglang/srt/configs/model_config.pypython/sglang/srt/server_args.pypython/sglang/srt/managers/schedule_batch.pypython/sglang/srt/managers/scheduler_output_processor_mixin.pypython/sglang/srt/mem_cache/common.pypython/sglang/srt/speculative/eagle_worker_v2.pypython/sglang/srt/speculative/multi_layer_eagle_worker_v2.pypython/sglang/srt/layers/attention/nsa_backend.pypython/sglang/srt/layers/attention/nsa/nsa_indexer.pypython/sglang/srt/layers/attention/nsa/utils.pypython/sglang/srt/layers/attention/nsa/transform_index.pypython/sglang/srt/layers/attention/nsa/quant_k_cache.pypython/sglang/srt/layers/attention/nsa/dequant_k_cache.pypython/sglang/srt/layers/communicator_nsa_cp.pypython/sglang/srt/function_call/deepseekv32_detector.pyexamples/chat_template/tool_chat_template_deepseekv32.jinja
Open PRs to Track
Check these before declaring a V3.2 gap:
- #11191: sparse attention and CPU/GPU KV scheduling for GQA/DSA, open.
- #12820: TP-SP for Qwen and DeepSeek V2/V3/V3.2, open.
- #16148: V3.2 W4AFP8 MTP with FP8 draft model, open.
- #17185: TP
o_projlinear in context-parallel NSA, open. - #17761: missing Assistant token after V3.1/V3.2 tool output, open.
- #18167: DCP support for V3.2, open.
- #18275: NPU all-gather after qlora for V3.2, open.
- #18733: V3.2 PD disaggregation test, open.
- #19211: extract V3.2/NSA logic into
DeepseekV32Mixin, open. - #19299: O(1) expert weight matching in DeepSeek weight loader, open.
- #19609: TP indexer weight in NSA attention, open.
- #19975: AMD context parallel for V3.2, open.
- #20360: AMD CP round-robin split garbage output, open.
- #20531: NSA indexer ragged gather mismatch in CP round-robin split, open.
- #20809: add
DeepseekV32ForCausalLMto MTP draft mapping, open. - #20880: reject HiCache L3 for NSA models, open.
- #21179: preserve V3.2 tool-call markers in reasoning parsing, open.
- #21194: AMD
PPMissingLayerfix in DeepSeek AITER gfx95 path, open. - #21506: V3.2 NPU torch compile, open.
- #21529: ROCm MXFP4 / Quark W4A4 support for DeepSeek architecture, open.
- #21530: ROCm fused MLA decode RoPE fix for DeepSeek-variant models, open.
- #21546: catch malformed JSON in V3.2 partial function-call parsing, open.
- #21889: AMD FP4 KV cache quantization for NSA TileLang, open.
- #22268: DeepSeek MLA LoRA adapter bypass in
prepare_qkv_latent, open. - #22473: dense MLA decode fallback for short sequences, open.
- #22774: MUSA backend support for DeepSeek V2/V3/R1-class layers, open.
- #22851:
--nsa-topk-backendand FlashInfer/PyTorch top-k, open. - #22865: sparsity framework extension for non-NSA sparse algorithms, open.
- #14332: V3.2 tool-call parsing without DSML tag, open.
- #14524: NSA backend test suite, open.
- #15322: V3.2
o_projTP support, open. - #18094: V3.2 piecewise CUDA graph, open and related to #23351.
- #18542: EAGLE3 plus NSA CP aux-hidden-state index bug, open.
- #19987: AMD FP8 KV cache for TileLang NSA backend, open.
- #20534: transfer FP8 K/K-scale for CP indexer prefill gather, open.
- #21623: unit tests for
encoding_dsv32.py, open. - #22792: AITER
indexer_k_quant_and_cache, open. - #23268: NPU accuracy fix for NSA CP plus prefix cache, open.
- #22938: restore DeepSeek MLA MI300X paths after the MLA refactor, open.
- #23195: guard
.weightaccess in DeepSeek MLA for AWQ/compressed-tensors, open. - #23241: 3FS backend for DSA/mamba, open.
- #23257: CuteDSL EP plus DP-attention double-reduce fix in
DeepseekV2MoE, open. - #23336: adaptive speculative-num-steps support for spec v2 EAGLE workers, open.
- #23351: piecewise CUDA graph with NSA, open.
Additional PR Coverage
Additional all-state PR coverage includes V3.2 bugfixes, closed experiments, tool-parser updates, and platform-specific backend work:
- Early bring-up polish: #11063, #11194, #11308, #11309, #11450, #11557, #11565, #11682, #11815, and #11835.
- Short-sequence MHA / Indexer fixes: #11892, #12094, #12582, #12583, #12645, #12788, #12816, #12964, #13022, #13459, and #13544.
- DSML/tool/parser path: #14304, #14307, #14353, #14573, #14750, #15064, #15278, #16091, #18126, #18174, and #17951.
- NSA backend / metadata / sparse-cache work: #14781, #14901, #15040, #15086, #15242, #15429, #16520, #16758, #16841, #17205, #17554, and #18319.
- HiSparse/HiCache and platform fixes: #14741, #17409, #17518, #17523, #17633, #18297, #18526, #20343, #21932, and #22238.
- Closed or superseded experiments to cite as history, not current support: #11109, #11596, #11761, #12017, #12052, #13531, #13546, #14619, #14904, #15051, #15217, #15310, #15807, #16079, #16881, #17024, #17199, #17310, and #17647.
- Round-2 runtime additions: #21249 adds all-reduce fusion with context parallel, #22003 relaxes
moe_dp_size == 1with differentattention_cp_sizevalues, #21599 adds adaptive EAGLE top-k=1 draft steps, #22128 allows PCG with speculative decoding, #23219 touches shared DSA/NextN infrastructure throughdeepseek_nextn.py, #22950 is the closed predecessor for reasoning radix-cache stripping, #23315 is the merged opt-in thinking-token strip from radix cache, and #23336 is the open spec-v2 adaptive-spec follow-up.
Evolution Path
Stage V32-0: Bring up DSA/NSA as a separate DeepSeek class
Key PR:
Success check:
DeepseekV32ForCausalLMexistsis_deepseek_nsa(config)is trueserver_args.pyselectsattention_backend = "nsa"NativeSparseAttnBackendandIndexerare active
Stage V32-1: Server defaults, KV cache dtype, and backend pair
V3.2 has model-specific defaults:
- DSA KV cache defaults to
fp8_e4m3on SM100 andbfloat16otherwise - only
bfloat16andfp8_e4m3are mainline DSA KV cache dtypes - ROCm defaults to TileLang NSA backends
- Blackwell defaults now prefer TRTLLM NSA kernels
- Hopper often uses
flashmla_sparse,flashmla_kv, orfa3
Key PRs:
Stage V32-2: Indexer correctness and performance
The NSA indexer computes sparse indices through q/k projection, weights projection, top-k, transforms, and optional KV-cache store.
Key PRs:
Success check:
weights_projavoids FP32 precision loss- K/S buffers use fused kernels where available
- FP8 KV cache store is fused or padded correctly for the selected backend
- AMD and NPU have separate indexer paths where needed
Stage V32-3: Context parallel, PP, and DP attention
Context parallel for NSA is powerful but constrained.
Key PRs:
Success check:
round-robin-splitis the current default CP token split methodin-seq-splitrequires DeepEP andep_size == tp_size- CP in PD decode mode is asserted away
- CP positions match EAGLE NextN
- key all-gather can overlap query computation
Stage V32-4: MTP and speculative decoding
V3.2 MTP must cooperate with NSA metadata, target verify, draft extend, and context parallel.
Key PRs:
Stage V32-5: Quantized checkpoints and platform lanes
Separate the backend tracks:
- NVFP4 Blackwell: #17657, #18389, #20086
- AMD MXFP4/TileLang/FP8 KV: #17783, #19945, #20840, #21511, #22258, #22850
- NPU: #14541, #14572, #15381, #16990, #17007, #21468
- HiSparse/HiCache: #21259, #22065, #22425
Stage V32-6: IndexCache
IndexCache reuses NSA top-k indices across layers.
Key PR:
Success check:
skip_topkandnext_skip_topkare set per layerindex_topk_freqandindex_topk_patternoverride behavior correctlyprev_topk_indicesis carried through layerstest/registered/8-gpu-models/test_deepseek_v32_indexcache.pyremains accurate
Stage V32-7: DSML tool calling and reasoning interaction
Standard V3.2 uses DSML:
<|DSML|function_calls><|DSML|invoke name="tool">...</|DSML|invoke></|DSML|function_calls>
The detector supports XML parameter tags and direct JSON. Track open parser bugs:
- #21179: reasoning parser should preserve V3.2 tool-call markers.
- #21546: catch malformed JSON while parsing partial function calls.
Validation Surface
Use the narrowest lane that matches the change:
- V3.2 base/MTP/DP/TP/tool-calling:
test/registered/8-gpu-models/test_deepseek_v32.py - NSA backend pair:
test_deepseek_v32_nsa_backendsinside that file - IndexCache:
test/registered/8-gpu-models/test_deepseek_v32_indexcache.py - chat template argument types:
test/manual/test_deepseek_chat_templates.py - CP and DeepEP-specific changes: use the dedicated CP/DeepEP suites referenced by the PR
- AMD changes: MI300/MI355 registered lanes
- NPU changes: Ascend/NPU model deployment and backend tests
More from bbuf/sglang-auto-driven-skills
h100
SSH into host `h100_sglang`, enter Docker container `sglang_bbuf`, work in `/sgl-workspace/sglang`, and use the ready H100 remote environment for SGLang development and validation. Use when a task needs remote CUDA work, GPU-backed smoke tests, diffusion checks, or a safe remote copy instead of local-only execution.
25h100-sglang-diffusion
SSH into host `h100_sglang`, enter Docker container `sglang_bbuf`, work in `/data/bbuf/repos/sglang`, and use the ready H100 remote environment for SGLang **diffusion** development and validation. Use when a task needs diffusion model smoke tests, Triton/CUDA kernel validation, torch.compile diffusion checks, or a safe remote copy for diffusion-specific SGLang changes.
24sglang-prod-incident-triage
Replay-first debug flow for SGLang serving problems. Use when a live or recent server shows health-check failures, latency or throughput regressions, queue growth, timeouts, distributed stalls, crash dumps, wrong outputs after deploys, or PD/EP/HiCache issues, and the job is to turn the problem into a replay plus the right next debug tool.
24sglang-minimax-m2-series-optimization
PR-backed and current-main optimization manual for the `MiniMaxAI/MiniMax-M2` series, including M2, M2.1, M2.5, M2.7, and M2.7-highspeed. Use when Codex needs to recover, extend, or audit MiniMax-specific optimizations, TP QK norm/all-reduce behavior, parser contracts, distributed runtime behavior, quantized loading, or backend-specific validation.
15sglang-torch-profiler-analysis
Compact SGLang torch-profiler triage skill. Use when Codex should inspect an existing `trace.json(.gz)` or profile directory, trigger `sglang.profiler` against a live server, and return one compact report with kernel, overlap-opportunity, and fuse-pattern tables. Single-trace triage is enough for quick diagnosis; mapping+formal two-trace triage gives stronger overlap conclusions.
15sglang-kimi-k2-k25-optimization
PR-backed and current-main optimization manual for `moonshotai/Kimi-K2*` and `moonshotai/Kimi-K2.5*` in SGLang. Use when Codex needs to recover, extend, or audit Kimi optimizations, including K2 router/MoE fast paths, K2 thinking Marlin paths, K2.5 wrapper/multimodal/runtime plumbing, W4AFP8/W4A16 quant tracks, parser contracts, LoRA coverage, and backend-specific validation.
15