sglang-deepseek-v32-optimization

Installation
SKILL.md

SGLang DeepSeek V3.2 Optimization

Overview

This skill covers the DeepSeek V3.2 support and optimization ladder active in SGLang main. V3.2 shares the DeepSeek V3/R1 model backbone, but it is a separate optimization problem because it activates DeepSeek Sparse Attention, called DSA in docs and NSA in SGLang code.

Current-main snapshot:

  • SGLang origin/main: 929e00eea on 2026-04-21
  • sgl-cookbook origin/main: 8ec4d03 on 2026-04-21
  • V3.2 runtime entry: DeepseekV32ForCausalLM in python/sglang/srt/models/deepseek_v2.py
  • NSA backend: python/sglang/srt/layers/attention/nsa_backend.py
  • NSA indexer: python/sglang/srt/layers/attention/nsa/nsa_indexer.py
  • V3.2 tool parser: python/sglang/srt/function_call/deepseekv32_detector.py

The historical evidence lives in:

Before You Change Anything

Record the exact serving shape first:

  • model: V3.2-Exp, V3.2, V3.2-Speciale, V3.2-NVFP4, or V3.2-MXFP4
  • whether is_deepseek_nsa(config) is true
  • --attention-backend, --nsa-prefill-backend, --nsa-decode-backend
  • KV cache dtype: auto, bfloat16, fp8_e4m3, or experimental FP4 tracks
  • TP / DP / EP / PP / PD topology
  • --enable-dp-attention
  • --enable-nsa-prefill-context-parallel
  • --nsa-prefill-cp-mode: round-robin-split or in-seq-split
  • MTP enabled or not
  • IndexCache knobs: index_topk_freq, index_topk_pattern
  • tool parser: V3.2-Exp may use deepseekv31 in the cookbook path, standard V3.2 uses deepseekv32
  • reasoning parser: --reasoning-parser deepseek-v3
  • hardware: H200, B200/GB200/GB300, AMD MI300/MI355, NPU, or another backend

Core Principle

Do not treat V3.2 as ordinary DeepSeek V3.

  • V3.2 turns on DSA/NSA through is_deepseek_nsa(config).
  • The attention hot path is split between the indexer, top-k transform, sparse MLA backend, and KV-cache quant/dequant.
  • Server defaults are model-specific: attention backend becomes nsa, KV cache dtype defaults differ by architecture, and NSA prefill/decode backends are auto-selected.
  • Context parallel is experimental and has strict mode-specific constraints.
  • MTP spans the NextN layer, NSA metadata, target_verify, draft_extend, CP positions, and speculative overlap.
  • V3.2 parser behavior is DSML for standard V3.2, while V3.2-Exp docs still point at the V3.1-style parser path.

The optimization order matters:

  1. confirm DSA detection and server defaults
  2. confirm KV cache dtype and NSA backend pair
  3. validate indexer top-k generation and transform
  4. validate MTP, CP, PP, or DP attention only after base DSA is correct
  5. then tune backend-specific kernels for Blackwell, Hopper, AMD, or NPU
  6. add model-backed tests for any IndexCache, MTP, CP, or backend change

Main Runtime Surfaces

Start from these files before changing behavior:

  • python/sglang/srt/models/deepseek_v2.py
  • python/sglang/srt/models/deepseek_nextn.py
  • python/sglang/srt/configs/model_config.py
  • python/sglang/srt/server_args.py
  • python/sglang/srt/managers/schedule_batch.py
  • python/sglang/srt/managers/scheduler_output_processor_mixin.py
  • python/sglang/srt/mem_cache/common.py
  • python/sglang/srt/speculative/eagle_worker_v2.py
  • python/sglang/srt/speculative/multi_layer_eagle_worker_v2.py
  • python/sglang/srt/layers/attention/nsa_backend.py
  • python/sglang/srt/layers/attention/nsa/nsa_indexer.py
  • python/sglang/srt/layers/attention/nsa/utils.py
  • python/sglang/srt/layers/attention/nsa/transform_index.py
  • python/sglang/srt/layers/attention/nsa/quant_k_cache.py
  • python/sglang/srt/layers/attention/nsa/dequant_k_cache.py
  • python/sglang/srt/layers/communicator_nsa_cp.py
  • python/sglang/srt/function_call/deepseekv32_detector.py
  • examples/chat_template/tool_chat_template_deepseekv32.jinja

Open PRs to Track

Check these before declaring a V3.2 gap:

  • #11191: sparse attention and CPU/GPU KV scheduling for GQA/DSA, open.
  • #12820: TP-SP for Qwen and DeepSeek V2/V3/V3.2, open.
  • #16148: V3.2 W4AFP8 MTP with FP8 draft model, open.
  • #17185: TP o_proj linear in context-parallel NSA, open.
  • #17761: missing Assistant token after V3.1/V3.2 tool output, open.
  • #18167: DCP support for V3.2, open.
  • #18275: NPU all-gather after qlora for V3.2, open.
  • #18733: V3.2 PD disaggregation test, open.
  • #19211: extract V3.2/NSA logic into DeepseekV32Mixin, open.
  • #19299: O(1) expert weight matching in DeepSeek weight loader, open.
  • #19609: TP indexer weight in NSA attention, open.
  • #19975: AMD context parallel for V3.2, open.
  • #20360: AMD CP round-robin split garbage output, open.
  • #20531: NSA indexer ragged gather mismatch in CP round-robin split, open.
  • #20809: add DeepseekV32ForCausalLM to MTP draft mapping, open.
  • #20880: reject HiCache L3 for NSA models, open.
  • #21179: preserve V3.2 tool-call markers in reasoning parsing, open.
  • #21194: AMD PPMissingLayer fix in DeepSeek AITER gfx95 path, open.
  • #21506: V3.2 NPU torch compile, open.
  • #21529: ROCm MXFP4 / Quark W4A4 support for DeepSeek architecture, open.
  • #21530: ROCm fused MLA decode RoPE fix for DeepSeek-variant models, open.
  • #21546: catch malformed JSON in V3.2 partial function-call parsing, open.
  • #21889: AMD FP4 KV cache quantization for NSA TileLang, open.
  • #22268: DeepSeek MLA LoRA adapter bypass in prepare_qkv_latent, open.
  • #22473: dense MLA decode fallback for short sequences, open.
  • #22774: MUSA backend support for DeepSeek V2/V3/R1-class layers, open.
  • #22851: --nsa-topk-backend and FlashInfer/PyTorch top-k, open.
  • #22865: sparsity framework extension for non-NSA sparse algorithms, open.
  • #14332: V3.2 tool-call parsing without DSML tag, open.
  • #14524: NSA backend test suite, open.
  • #15322: V3.2 o_proj TP support, open.
  • #18094: V3.2 piecewise CUDA graph, open and related to #23351.
  • #18542: EAGLE3 plus NSA CP aux-hidden-state index bug, open.
  • #19987: AMD FP8 KV cache for TileLang NSA backend, open.
  • #20534: transfer FP8 K/K-scale for CP indexer prefill gather, open.
  • #21623: unit tests for encoding_dsv32.py, open.
  • #22792: AITER indexer_k_quant_and_cache, open.
  • #23268: NPU accuracy fix for NSA CP plus prefix cache, open.
  • #22938: restore DeepSeek MLA MI300X paths after the MLA refactor, open.
  • #23195: guard .weight access in DeepSeek MLA for AWQ/compressed-tensors, open.
  • #23241: 3FS backend for DSA/mamba, open.
  • #23257: CuteDSL EP plus DP-attention double-reduce fix in DeepseekV2MoE, open.
  • #23336: adaptive speculative-num-steps support for spec v2 EAGLE workers, open.
  • #23351: piecewise CUDA graph with NSA, open.

Additional PR Coverage

Additional all-state PR coverage includes V3.2 bugfixes, closed experiments, tool-parser updates, and platform-specific backend work:

Evolution Path

Stage V32-0: Bring up DSA/NSA as a separate DeepSeek class

Key PR:

Success check:

  • DeepseekV32ForCausalLM exists
  • is_deepseek_nsa(config) is true
  • server_args.py selects attention_backend = "nsa"
  • NativeSparseAttnBackend and Indexer are active

Stage V32-1: Server defaults, KV cache dtype, and backend pair

V3.2 has model-specific defaults:

  • DSA KV cache defaults to fp8_e4m3 on SM100 and bfloat16 otherwise
  • only bfloat16 and fp8_e4m3 are mainline DSA KV cache dtypes
  • ROCm defaults to TileLang NSA backends
  • Blackwell defaults now prefer TRTLLM NSA kernels
  • Hopper often uses flashmla_sparse, flashmla_kv, or fa3

Key PRs:

Stage V32-2: Indexer correctness and performance

The NSA indexer computes sparse indices through q/k projection, weights projection, top-k, transforms, and optional KV-cache store.

Key PRs:

Success check:

  • weights_proj avoids FP32 precision loss
  • K/S buffers use fused kernels where available
  • FP8 KV cache store is fused or padded correctly for the selected backend
  • AMD and NPU have separate indexer paths where needed

Stage V32-3: Context parallel, PP, and DP attention

Context parallel for NSA is powerful but constrained.

Key PRs:

Success check:

  • round-robin-split is the current default CP token split method
  • in-seq-split requires DeepEP and ep_size == tp_size
  • CP in PD decode mode is asserted away
  • CP positions match EAGLE NextN
  • key all-gather can overlap query computation

Stage V32-4: MTP and speculative decoding

V3.2 MTP must cooperate with NSA metadata, target verify, draft extend, and context parallel.

Key PRs:

Stage V32-5: Quantized checkpoints and platform lanes

Separate the backend tracks:

Stage V32-6: IndexCache

IndexCache reuses NSA top-k indices across layers.

Key PR:

Success check:

  • skip_topk and next_skip_topk are set per layer
  • index_topk_freq and index_topk_pattern override behavior correctly
  • prev_topk_indices is carried through layers
  • test/registered/8-gpu-models/test_deepseek_v32_indexcache.py remains accurate

Stage V32-7: DSML tool calling and reasoning interaction

Standard V3.2 uses DSML:

<|DSML|function_calls><|DSML|invoke name="tool">...</|DSML|invoke></|DSML|function_calls>

The detector supports XML parameter tags and direct JSON. Track open parser bugs:

  • #21179: reasoning parser should preserve V3.2 tool-call markers.
  • #21546: catch malformed JSON while parsing partial function calls.

Validation Surface

Use the narrowest lane that matches the change:

  • V3.2 base/MTP/DP/TP/tool-calling: test/registered/8-gpu-models/test_deepseek_v32.py
  • NSA backend pair: test_deepseek_v32_nsa_backends inside that file
  • IndexCache: test/registered/8-gpu-models/test_deepseek_v32_indexcache.py
  • chat template argument types: test/manual/test_deepseek_chat_templates.py
  • CP and DeepEP-specific changes: use the dedicated CP/DeepEP suites referenced by the PR
  • AMD changes: MI300/MI355 registered lanes
  • NPU changes: Ascend/NPU model deployment and backend tests
Related skills

More from bbuf/sglang-auto-driven-skills

Installs
1
GitHub Stars
202
First Seen
Apr 23, 2026