sglang-deepseek-v3-r1-optimization

Installation
SKILL.md

SGLang DeepSeek V3/R1 Optimization

Overview

This skill covers the DeepSeek V3/R1 optimization ladder that is active in SGLang main. It intentionally excludes the V3.1 parser delta and the V3.2 DSA/NSA sparse-attention stack, which have separate skills.

Current-main snapshot:

  • SGLang origin/main: 929e00eea on 2026-04-21
  • sgl-cookbook origin/main: 8ec4d03 on 2026-04-21
  • active runtime entry: python/sglang/srt/models/deepseek_v2.py
  • DeepSeek V3/R1 entry class: DeepseekV3ForCausalLM
  • NextN/MTP entry class: DeepseekV3ForCausalLMNextN

The historical evidence lives in:

Before You Change Anything

Record the exact serving shape first:

  • model: V3, V3-0324, R1, R1-0528, R1-distill, or vendor quantized checkpoint
  • native FP8, BF16, INT8, AWQ, W4A8/W4AFP8, NVFP4, MXFP4, MXFP8, or LoRA
  • TP / DP / EP / PP / PD topology
  • DP attention enabled or not
  • DeepEP mode: none, normal, low_latency, or auto
  • MoE runner backend: triton, deep_gemm, flashinfer_trtllm, flashinfer_cutlass, flashinfer_cutedsl, cutlass_w4afp8, aiter, or auto
  • MLA attention backend: fa3, flashinfer, flashmla, cutlass_mla, trtllm_mla, aiter, triton, CPU/XPU/NPU fallback
  • MTP enabled or not, and whether the NextN layer is quantized differently from the target model
  • parser pair: --reasoning-parser deepseek-v3 for V3 thinking-style output, --reasoning-parser deepseek-r1 for R1, and --tool-call-parser deepseekv3 for V3/R1 tool calls
  • exact registered/manual test lane and hardware

Core Principle

Do not treat DeepSeek V3/R1 as only one optimization.

  • The V3/R1 base path is an MLA plus MoE throughput problem.
  • R1 adds a reasoning-parser contract and heavier quantized deployment tracks.
  • W4AFP8 is its own loader, kernel, EP, TP, and DeepEP story.
  • MTP is a NextN-model story; target model and draft layer can have different quantization or backend requirements.
  • Shared expert fusion is powerful but topology-sensitive. On current main it is disabled under DeepEP unless --enforce-shared-experts-fusion is set.
  • On Blackwell, server defaults may automatically select trtllm_mla and flashinfer_trtllm; on ROCm, AITER and TileLang paths are separate validation surfaces.

The optimization order matters:

  1. confirm the loader and quant config
  2. confirm MLA backend and KV cache dtype
  3. confirm MoE runner and shared expert behavior
  4. add or validate MTP
  5. add DP attention, EP, PP, PD, or DeepEP only after the single-shape path is correct
  6. harden PCG, deterministic, LoRA, and backend-specific tests

Main Runtime Surfaces

Start from these files before changing behavior:

  • python/sglang/srt/models/deepseek_v2.py
  • python/sglang/srt/models/deepseek_nextn.py
  • python/sglang/srt/models/deepseek_common/deepseek_weight_loader.py
  • python/sglang/srt/models/deepseek_common/attention_backend_handler.py
  • python/sglang/srt/models/deepseek_common/attention_forward_methods/
  • python/sglang/srt/layers/attention/flashattention_backend.py
  • python/sglang/srt/layers/radix_attention.py
  • python/sglang/srt/mem_cache/memory_pool.py
  • python/sglang/srt/layers/quantization/fp8_kernel.py
  • python/sglang/srt/layers/quantization/deep_gemm.py
  • python/sglang/compile_deep_gemm.py
  • python/sglang/srt/layers/moe/topk.py
  • python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py
  • python/sglang/srt/layers/moe/cutlass_w4a8_moe.py
  • python/sglang/srt/layers/moe/ep_moe/layer.py
  • python/sglang/srt/layers/moe/token_dispatcher/deepep.py
  • python/sglang/srt/layers/quantization/w4afp8.py
  • python/sglang/srt/layers/rotary_embedding.py
  • python/sglang/srt/server_args.py
  • python/sglang/srt/managers/schedule_batch.py
  • python/sglang/srt/mem_cache/common.py
  • python/sglang/srt/parser/reasoning_parser.py
  • python/sglang/srt/function_call/deepseekv3_detector.py
  • sgl-kernel/csrc/moe/moe_fused_gate.cu
  • sgl-kernel/csrc/moe/moe_align_kernel.cu
  • sgl-kernel/csrc/attention/merge_attn_states.cu
  • sgl-kernel/include/sgl_kernel_ops.h

Open PRs to Track

Check these before declaring a V3/R1 gap:

  • #14194: DCP for deepseek_v2, open.
  • #15315 and #15380: group GEMM work for DeepSeek-R1-W4AFP8, open.
  • #18892: JIT support for DeepSeek V3 GEMM, open.
  • #6011: FlashInfer MLA speculative-decoding custom mask, open.
  • #6738: partial MHA-kernel support in MLA forward when page size is greater than one, open.
  • #7005: FlowMLA zero-overhead DP MLA memory optimization, open.
  • #23336: adaptive speculative-num-steps support for spec v2 EAGLE workers, open.
  • #21526: ROCm AITER router GEMM regression for non-DSR1 MoE, open.
  • #21529: MXFP4 / Quark W4A4 support for DeepSeek architecture on ROCm, open.
  • #21530: ROCm fused MLA decode RoPE fix for DeepSeek variants, open.
  • #21531: migrate dsv3_router_gemm from AOT sgl-kernel to JIT, open.
  • #22268: fix LoRA adapters bypassed in DeepSeek MLA prepare_qkv_latent, open.
  • #22774: MUSA backend support for DeepSeek V2/V3/R1, open.
  • #22938: restore DeepSeek MLA MI300X paths after the MLA refactor, open.
  • #23195: guard .weight access in DeepseekV2AttentionMLA for AWQ/compressed-tensors, open.
  • #23257: double-reduce in DeepseekV2MoE with CuteDSL EP plus DP attention, open.

Known reverted track:

  • #14162 enabled FP8 communication for R1 W4A8 DeepEP low-latency, was reverted by #21719, then relanded by #22316.

Known exploratory or closed tracks:

  • #5432 introduced a DeepGEMM group_gemm_masked BMM path for MLA FP8 quantization. Treat it as an explored path, not as the default production H200 speed path.
  • #6151 explored hybrid attention backend wiring and closed without becoming the main V3/R1 path.
  • #22950 explored parser-gated two-phase reasoning radix-cache stripping and closed before becoming current support; read #23315 for the merged path.
  • #22933 is current-main CPU shared-expert interface cleanup. It matters for CPU shared-expert parity, not for H200 GPU throughput.

Runtime Addendum

Additional current-main runtime tracks should be checked in addition to the original H200 ladder:

  • #21599 adds adaptive speculative_num_steps for EAGLE top-k=1. For V3/R1 MTP, inspect server_args.py, speculative runtime params, and EAGLE worker state before assuming the number of draft steps is static.
  • #22128 allows piecewise CUDA graph to run with speculative decoding. When auditing PCG plus MTP failures, check model_runner.py, piecewise_cuda_graph_runner.py, and the server flag gate instead of treating the combination as unsupported.
  • #23315 adds opt-in stripping of thinking tokens from radix cache. This touches schedule_batch.py, mem_cache/common.py, and server_args.py; it matters for DeepSeek V3/R1 reasoning-parser cache reuse, especially when thinking tokens should not become a reusable prefix.
  • #22950 is the closed predecessor for that reasoning-cache behavior, and #23336 is the open spec-v2 extension for adaptive speculative decoding.

H200 Single-Node Optimization Findings

The single-node H200 optimization notes explicitly name a March-May 2025 H200 optimization ladder. A complete V3/R1 audit must include those PRs because many later abstractions hide the original performance reason.

Required H200 PR coverage:

When updating this skill, explicitly mark whether an H200 optimization is still current-main default, current-main optional, hardware-specific, or only an explored/closed path.

Evolution Path

Stage V3R1-H200: Single-node H200 performance ladder

The H200 ladder is the missing context behind many later V3/R1 defaults.

  • FP8 Block GEMM evolved from Triton/Cutlass experiments into DeepGEMM-backed paths. Current main exposes DeepGEMM through fp8_kernel.py, deep_gemm.py, and compile_deep_gemm.py; Hopper/Blackwell defaults should be checked with SGLANG_ENABLE_JIT_DEEPGEMM.
  • Fused MoE acceleration starts before the current fused_topk_deepseek abstraction. moe_fused_gate.cu, moe_align_kernel.cu, per_token_group_quant_8bit, routed scaling fusion, and shared-expert fusion all belong to this stage.
  • MLA backend selection was built across FlashMLA, Cutlass MLA, FA3 MLA, and later MHA-chunked prefill paths. Do not conclude that a backend is current based only on an early support PR; check server_args.py, attention_backend_handler.py, and docs/basic_usage/deepseek_v3.md.
  • Small model-file optimizations matter: DeepSeek CUDA RoPE, removing forward_absorb copies, fusing q_a_proj with kv_a_proj_with_mqa, fusing MLA KV-cache writes, overlapping q/k norm, and removing scalar/allocator overhead all live on the hot path.
  • The closed hybrid-attention PR #6151 should be cited as non-mainline context, not as shipped V3/R1 support.

Success check:

  • every H200-note PR is either in the timeline, in a closed/exploratory note, or explicitly marked out of scope
  • current source paths have replaced historical file names where refactors moved the code
  • DeepGEMM, FA3/FlashMLA/Cutlass MLA, fused MoE, shared experts, and MTP interactions are checked together

Stage V3R1-0: Basic V3/R1 support is not enough

Early DeepSeek support can launch, but the optimized path needs hardware-specific MLA, MoE, and quant handling.

Success check:

  • DeepseekV3ForCausalLM is selected
  • the loader recognizes the quant config
  • launch docs and current server defaults agree

Stage V3R1-1: MLA backend and FP8 correctness

DeepSeek V3/R1 performance depends on MLA path selection, weight absorption, KV-cache dtype, DeepGEMM, and backend fallback.

  • block-wise FP8 and BMM paths feed the MLA absorbed path
  • server_args.py selects trtllm_mla on SM100 when no attention backend is set
  • deepseek_weight_loader.py requantizes or dequantizes kv_b_proj according to quant format
  • deterministic mode forces supported DeepSeek attention backends

Key PRs:

Stage V3R1-2: MoE routing and shared experts

The main model has 256 routed experts plus one shared expert. Current main can remap mlp.shared_experts into expert slot 256 when shared-expert fusion is active.

  • DeepseekV2MoE computes num_fused_shared_experts
  • TopK is configured with grouped top-k, correction bias, routed scaling, and optional fused shared expert slots
  • determine_num_fused_shared_experts() disables fusion for incompatible shapes, W4AFP8 shared/routed mismatch, TBO/SBO, or DeepEP unless explicitly enforced
  • DeepEP fusion expands the local expert layout from 256 routed experts to 256 + EP_size slots

Key PRs:

Stage V3R1-3: MTP and NextN

DeepSeek V3/R1 MTP uses the NextN model as an EAGLE draft path.

  • target model: DeepseekV3ForCausalLM
  • draft model: DeepseekV3ForCausalLMNextN
  • deepseek_nextn.py handles the single NextN layer, shared head/embed reuse, quant override, and AMD R1 MXFP4 naming
  • current main allows quantized target and BF16 or differently quantized NextN layers, but validate the exact backend pair

Key PRs:

Validation:

  • test/registered/8-gpu-models/test_deepseek_v3_mtp.py
  • test/registered/amd/test_deepseek_v3_mtp.py
  • test/registered/spec/eagle/test_deepseek_v3_fp4_mtp_small.py

Stage V3R1-4: R1 W4AFP8 and DeepEP

R1 W4AFP8 is a separate ladder from native FP8.

  • W4AFp8Config detects mixed precision and maps linear layers to FP8 while MoE experts use W4A8
  • cutlass_w4a8_moe.py handles packed int4 expert weights and FP8 activations
  • EP support maps global experts to local partitions
  • normal DeepEP uses dispatch output metadata and apply_deepep_normal
  • low-latency DeepEP has special communication behavior and must be checked against the revert/reland sequence

Key PRs:

Stage V3R1-5: Quantized backend coverage

Treat each quantization format as a separate loader and backend contract.

  • W4AFP8: w4afp8.py, cutlass_w4a8_moe.py, DeepEP and TP variants
  • NVFP4 / ModelOpt FP4: server defaults and Blackwell backend selection
  • MXFP4 / Quark: AMD R1 and open ROCm DeepSeek-architecture work
  • MXFP8: FlashInfer TRTLLM routed MoE support
  • LoRA: quant-info refactor and DeepSeek MLA LoRA support

Key PRs:

Stage V3R1-6: Distributed and backend hardening

The late-stage failures are usually topology bugs rather than model-architecture bugs.

  • DP attention plus torch.compile GPU faults
  • BF16 KV accuracy under DP
  • AITER correction-bias dtype conversion
  • XPU and MUSA backend compatibility
  • OOM during weight loading
  • PCG and deterministic test cleanup

Key PRs:

Validation Surface

Use the narrowest lane that matches the change:

  • base V3: test/registered/8-gpu-models/test_deepseek_v3_basic.py
  • V3 MTP: test/registered/8-gpu-models/test_deepseek_v3_mtp.py
  • AMD V3: test/registered/amd/test_deepseek_v3_basic.py
  • AMD V3 KV FP8: test/registered/amd/test_deepseek_v3_basic_kv_fp8.py
  • R1 MXFP4: test/registered/amd/test_deepseek_r1_mxfp4_8gpu.py
  • R1 FP8 TRTLLM backend: test/registered/backends/test_deepseek_r1_fp8_trtllm_backend.py
  • FP4: test/registered/quant/test_deepseek_v3_fp4_4gpu.py
  • W4A8: test/registered/quant/test_w4a8_deepseek_v3.py
  • MLA: test/registered/mla/test_mla_deepseek_v3.py
  • INT8 MLA: test/registered/mla/test_mla_int8_deepseek_v3.py
  • LoRA: test/registered/lora/test_lora_deepseek_v3_base_logprob_diff.py
  • router/top-k kernel: test/registered/kernels/test_fused_topk_deepseek.py
Related skills

More from bbuf/sglang-auto-driven-skills

Installs
1
GitHub Stars
202
First Seen
Apr 23, 2026