sglang-deepseek-v32-optimization

Installation

SKILL.md

SGLang DeepSeek V3.2 Optimization

Overview

This skill covers the DeepSeek V3.2 support and optimization ladder active in SGLang main. V3.2 shares the DeepSeek V3/R1 model backbone, but it is a separate optimization problem because it activates DeepSeek Sparse Attention, called DSA in docs and NSA in SGLang code.

Current-main snapshot:

SGLang origin/main: 929e00eea on 2026-04-21
sgl-cookbook origin/main: 8ec4d03 on 2026-04-21
V3.2 runtime entry: DeepseekV32ForCausalLM in python/sglang/srt/models/deepseek_v2.py
NSA backend: python/sglang/srt/layers/attention/nsa_backend.py
NSA indexer: python/sglang/srt/layers/attention/nsa/nsa_indexer.py
V3.2 tool parser: python/sglang/srt/function_call/deepseekv32_detector.py

The historical evidence lives in:

references/pr-history.md: chronological PR evidence and code-level notes
references/playbook.md: investigation order, symptom mapping, validation commands

Before You Change Anything

Record the exact serving shape first:

model: V3.2-Exp, V3.2, V3.2-Speciale, V3.2-NVFP4, or V3.2-MXFP4
whether is_deepseek_nsa(config) is true
--attention-backend, --nsa-prefill-backend, --nsa-decode-backend
KV cache dtype: auto, bfloat16, fp8_e4m3, or experimental FP4 tracks
TP / DP / EP / PP / PD topology
--enable-dp-attention
--enable-nsa-prefill-context-parallel
--nsa-prefill-cp-mode: round-robin-split or in-seq-split
MTP enabled or not
IndexCache knobs: index_topk_freq, index_topk_pattern
tool parser: V3.2-Exp may use deepseekv31 in the cookbook path, standard V3.2 uses deepseekv32
reasoning parser: --reasoning-parser deepseek-v3
hardware: H200, B200/GB200/GB300, AMD MI300/MI355, NPU, or another backend

Core Principle

Do not treat V3.2 as ordinary DeepSeek V3.

V3.2 turns on DSA/NSA through is_deepseek_nsa(config).
The attention hot path is split between the indexer, top-k transform, sparse MLA backend, and KV-cache quant/dequant.
Server defaults are model-specific: attention backend becomes nsa, KV cache dtype defaults differ by architecture, and NSA prefill/decode backends are auto-selected.
Context parallel is experimental and has strict mode-specific constraints.
MTP spans the NextN layer, NSA metadata, target_verify, draft_extend, CP positions, and speculative overlap.
V3.2 parser behavior is DSML for standard V3.2, while V3.2-Exp docs still point at the V3.1-style parser path.

The optimization order matters:

confirm DSA detection and server defaults
confirm KV cache dtype and NSA backend pair
validate indexer top-k generation and transform
validate MTP, CP, PP, or DP attention only after base DSA is correct
then tune backend-specific kernels for Blackwell, Hopper, AMD, or NPU
add model-backed tests for any IndexCache, MTP, CP, or backend change

Main Runtime Surfaces

Start from these files before changing behavior:

python/sglang/srt/models/deepseek_v2.py
python/sglang/srt/models/deepseek_nextn.py
python/sglang/srt/configs/model_config.py
python/sglang/srt/server_args.py
python/sglang/srt/managers/schedule_batch.py
python/sglang/srt/managers/scheduler_output_processor_mixin.py
python/sglang/srt/mem_cache/common.py
python/sglang/srt/speculative/eagle_worker_v2.py
python/sglang/srt/speculative/multi_layer_eagle_worker_v2.py
python/sglang/srt/layers/attention/nsa_backend.py
python/sglang/srt/layers/attention/nsa/nsa_indexer.py
python/sglang/srt/layers/attention/nsa/utils.py
python/sglang/srt/layers/attention/nsa/transform_index.py
python/sglang/srt/layers/attention/nsa/quant_k_cache.py
python/sglang/srt/layers/attention/nsa/dequant_k_cache.py
python/sglang/srt/layers/communicator_nsa_cp.py
python/sglang/srt/function_call/deepseekv32_detector.py
examples/chat_template/tool_chat_template_deepseekv32.jinja

Open PRs to Track

Check these before declaring a V3.2 gap:

#11191: sparse attention and CPU/GPU KV scheduling for GQA/DSA, open.
#12820: TP-SP for Qwen and DeepSeek V2/V3/V3.2, open.
#16148: V3.2 W4AFP8 MTP with FP8 draft model, open.
#17185: TP o_proj linear in context-parallel NSA, open.
#17761: missing Assistant token after V3.1/V3.2 tool output, open.
#18167: DCP support for V3.2, open.
#18275: NPU all-gather after qlora for V3.2, open.
#18733: V3.2 PD disaggregation test, open.
#19211: extract V3.2/NSA logic into DeepseekV32Mixin, open.
#19299: O(1) expert weight matching in DeepSeek weight loader, open.
#19609: TP indexer weight in NSA attention, open.
#19975: AMD context parallel for V3.2, open.
#20360: AMD CP round-robin split garbage output, open.
#20531: NSA indexer ragged gather mismatch in CP round-robin split, open.
#20809: add DeepseekV32ForCausalLM to MTP draft mapping, open.
#20880: reject HiCache L3 for NSA models, open.
#21179: preserve V3.2 tool-call markers in reasoning parsing, open.
#21194: AMD PPMissingLayer fix in DeepSeek AITER gfx95 path, open.
#21506: V3.2 NPU torch compile, open.
#21529: ROCm MXFP4 / Quark W4A4 support for DeepSeek architecture, open.
#21530: ROCm fused MLA decode RoPE fix for DeepSeek-variant models, open.
#21546: catch malformed JSON in V3.2 partial function-call parsing, open.
#21889: AMD FP4 KV cache quantization for NSA TileLang, open.
#22268: DeepSeek MLA LoRA adapter bypass in prepare_qkv_latent, open.
#22473: dense MLA decode fallback for short sequences, open.
#22774: MUSA backend support for DeepSeek V2/V3/R1-class layers, open.
#22851: --nsa-topk-backend and FlashInfer/PyTorch top-k, open.
#22865: sparsity framework extension for non-NSA sparse algorithms, open.
#14332: V3.2 tool-call parsing without DSML tag, open.
#14524: NSA backend test suite, open.
#15322: V3.2 o_proj TP support, open.
#18094: V3.2 piecewise CUDA graph, open and related to #23351.
#18542: EAGLE3 plus NSA CP aux-hidden-state index bug, open.
#19987: AMD FP8 KV cache for TileLang NSA backend, open.
#20534: transfer FP8 K/K-scale for CP indexer prefill gather, open.
#21623: unit tests for encoding_dsv32.py, open.
#22792: AITER indexer_k_quant_and_cache, open.
#23268: NPU accuracy fix for NSA CP plus prefix cache, open.
#22938: restore DeepSeek MLA MI300X paths after the MLA refactor, open.
#23195: guard .weight access in DeepSeek MLA for AWQ/compressed-tensors, open.
#23241: 3FS backend for DSA/mamba, open.
#23257: CuteDSL EP plus DP-attention double-reduce fix in DeepseekV2MoE, open.
#23336: adaptive speculative-num-steps support for spec v2 EAGLE workers, open.
#23351: piecewise CUDA graph with NSA, open.

Additional PR Coverage

Additional all-state PR coverage includes V3.2 bugfixes, closed experiments, tool-parser updates, and platform-specific backend work:

Early bring-up polish: #11063, #11194, #11308, #11309, #11450, #11557, #11565, #11682, #11815, and #11835.
Short-sequence MHA / Indexer fixes: #11892, #12094, #12582, #12583, #12645, #12788, #12816, #12964, #13022, #13459, and #13544.
DSML/tool/parser path: #14304, #14307, #14353, #14573, #14750, #15064, #15278, #16091, #18126, #18174, and #17951.
NSA backend / metadata / sparse-cache work: #14781, #14901, #15040, #15086, #15242, #15429, #16520, #16758, #16841, #17205, #17554, and #18319.
HiSparse/HiCache and platform fixes: #14741, #17409, #17518, #17523, #17633, #18297, #18526, #20343, #21932, and #22238.
Closed or superseded experiments to cite as history, not current support: #11109, #11596, #11761, #12017, #12052, #13531, #13546, #14619, #14904, #15051, #15217, #15310, #15807, #16079, #16881, #17024, #17199, #17310, and #17647.
Round-2 runtime additions: #21249 adds all-reduce fusion with context parallel, #22003 relaxes moe_dp_size == 1 with different attention_cp_size values, #21599 adds adaptive EAGLE top-k=1 draft steps, #22128 allows PCG with speculative decoding, #23219 touches shared DSA/NextN infrastructure through deepseek_nextn.py, #22950 is the closed predecessor for reasoning radix-cache stripping, #23315 is the merged opt-in thinking-token strip from radix cache, and #23336 is the open spec-v2 adaptive-spec follow-up.

Evolution Path

Stage V32-0: Bring up DSA/NSA as a separate DeepSeek class

Key PR:

#11061

Success check:

DeepseekV32ForCausalLM exists
is_deepseek_nsa(config) is true
server_args.py selects attention_backend = "nsa"
NativeSparseAttnBackend and Indexer are active

Stage V32-1: Server defaults, KV cache dtype, and backend pair

V3.2 has model-specific defaults:

DSA KV cache defaults to fp8_e4m3 on SM100 and bfloat16 otherwise
only bfloat16 and fp8_e4m3 are mainline DSA KV cache dtypes
ROCm defaults to TileLang NSA backends
Blackwell defaults now prefer TRTLLM NSA kernels
Hopper often uses flashmla_sparse, flashmla_kv, or fa3

Key PRs:

Stage V32-2: Indexer correctness and performance

The NSA indexer computes sparse indices through q/k projection, weights projection, top-k, transforms, and optional KV-cache store.

Key PRs:

Success check:

weights_proj avoids FP32 precision loss
K/S buffers use fused kernels where available
FP8 KV cache store is fused or padded correctly for the selected backend
AMD and NPU have separate indexer paths where needed

Stage V32-3: Context parallel, PP, and DP attention

Context parallel for NSA is powerful but constrained.

Key PRs:

Success check:

round-robin-split is the current default CP token split method
in-seq-split requires DeepEP and ep_size == tp_size
CP in PD decode mode is asserted away
CP positions match EAGLE NextN
key all-gather can overlap query computation

Stage V32-4: MTP and speculative decoding

V3.2 MTP must cooperate with NSA metadata, target verify, draft extend, and context parallel.

Key PRs:

Stage V32-5: Quantized checkpoints and platform lanes

Separate the backend tracks:

NVFP4 Blackwell: #17657, #18389, #20086
AMD MXFP4/TileLang/FP8 KV: #17783, #19945, #20840, #21511, #22258, #22850
NPU: #14541, #14572, #15381, #16990, #17007, #21468
HiSparse/HiCache: #21259, #22065, #22425

Stage V32-6: IndexCache

IndexCache reuses NSA top-k indices across layers.

Key PR:

#21405

Success check:

skip_topk and next_skip_topk are set per layer
index_topk_freq and index_topk_pattern override behavior correctly
prev_topk_indices is carried through layers
test/registered/8-gpu-models/test_deepseek_v32_indexcache.py remains accurate

Stage V32-7: DSML tool calling and reasoning interaction

Standard V3.2 uses DSML:

<｜DSML｜function_calls><｜DSML｜invoke name="tool">...</｜DSML｜invoke></｜DSML｜function_calls>

The detector supports XML parameter tags and direct JSON. Track open parser bugs:

#21179: reasoning parser should preserve V3.2 tool-call markers.
#21546: catch malformed JSON while parsing partial function calls.

Validation Surface

Use the narrowest lane that matches the change:

V3.2 base/MTP/DP/TP/tool-calling: test/registered/8-gpu-models/test_deepseek_v32.py
NSA backend pair: test_deepseek_v32_nsa_backends inside that file
IndexCache: test/registered/8-gpu-models/test_deepseek_v32_indexcache.py
chat template argument types: test/manual/test_deepseek_chat_templates.py
CP and DeepEP-specific changes: use the dedicated CP/DeepEP suites referenced by the PR
AMD changes: MI300/MI355 registered lanes
NPU changes: Ascend/NPU model deployment and backend tests

Related skills

More from bbuf/sglang-auto-driven-skills

Installs

Repository

bbuf/sglang-aut…n-skills

GitHub Stars

202

First Seen

Apr 23, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykWarn

sglang-deepseek-v32-optimization

SGLang DeepSeek V3.2 Optimization

Overview

Before You Change Anything

Core Principle

Main Runtime Surfaces

Open PRs to Track

Additional PR Coverage

Evolution Path

Stage V32-0: Bring up DSA/NSA as a separate DeepSeek class

Stage V32-1: Server defaults, KV cache dtype, and backend pair

Stage V32-2: Indexer correctness and performance

Stage V32-3: Context parallel, PP, and DP attention

Stage V32-4: MTP and speculative decoding

Stage V32-5: Quantized checkpoints and platform lanes

Stage V32-6: IndexCache

Stage V32-7: DSML tool calling and reasoning interaction

Validation Surface

More from bbuf/sglang-auto-driven-skills

h100

h100-sglang-diffusion

sglang-prod-incident-triage

sglang-minimax-m2-series-optimization

sglang-torch-profiler-analysis

sglang-kimi-k2-k25-optimization