sglang-deepseek-v31-optimization
SGLang DeepSeek V3.1 Optimization
Overview
This skill covers the DeepSeek V3.1 support and optimization ladder that is active in SGLang main. V3.1 shares the DeepSeek V3/R1 model backbone, but its tool-call format, chat template, thinking flag, streaming parser, and validation lanes are separate enough to require an independent skill.
Current-main snapshot:
- SGLang
origin/main:929e00eeaon2026-04-21 - sgl-cookbook
origin/main:8ec4d03on2026-04-21 - core runtime entry:
python/sglang/srt/models/deepseek_v2.py - V3.1 tool parser:
python/sglang/srt/function_call/deepseekv31_detector.py - V3.1 tool template:
examples/chat_template/tool_chat_template_deepseekv31.jinja
The historical evidence lives in:
- references/pr-history.md: chronological PR evidence and code-level notes
- references/playbook.md: investigation order, symptom mapping, validation commands
Before You Change Anything
Record the exact serving shape first:
- model:
deepseek-ai/DeepSeek-V3.1,DeepSeek-V3.1-Terminus, orDeepSeek-V3.1-Speciale - whether thinking mode is enabled with
chat_template_kwargs.thinking - whether tool calling is enabled with
--tool-call-parser deepseekv31 - whether the V3.1 tool template is used:
examples/chat_template/tool_chat_template_deepseekv31.jinja --reasoning-parser deepseek-v3- TP / DP / EP / PP topology
- MTP enabled or not
- backend and quantization inherited from the DeepSeek V3/R1 backbone
- streaming or non-streaming OpenAI API path
- exact test lane: manual, nightly, parser unit, chat-template unit, or model-backed accuracy/perf
Core Principle
Do not debug V3.1 as ordinary V3.
- The model class is shared with V3/R1, so MLA, MoE, quantization, DeepEP, and MTP bugs usually belong to
sglang-deepseek-v3-r1-optimization. - The user-visible V3.1 delta is parser and template behavior: hybrid thinking, tool calling, streaming deltas, and structural-tag constrained decoding.
- V3.1 tool calls do not use V3's literal
functionmarker or fenced JSON block. - V3.1 uses
chat_template_kwargs: {"thinking": true}with--reasoning-parser deepseek-v3, not R1's parser. - DeepSeek-V3.1-Speciale should not be treated as a tool-calling target.
The optimization order matters:
- confirm the parser/template contract
- confirm streaming and non-streaming parity
- confirm thinking mode and
</think>template behavior - confirm model loading and MTP inherited from V3/R1
- only then tune MoE/backend performance
- add CPU parser tests for parser-only changes before running model-backed tests
Main Runtime Surfaces
Start from these files before changing behavior:
python/sglang/srt/function_call/deepseekv31_detector.pypython/sglang/srt/function_call/function_call_parser.pyexamples/chat_template/tool_chat_template_deepseekv31.jinjapython/sglang/srt/entrypoints/openai/serving_chat.pypython/sglang/srt/parser/reasoning_parser.pypython/sglang/srt/managers/schedule_batch.pypython/sglang/srt/mem_cache/common.pypython/sglang/srt/models/deepseek_v2.pypython/sglang/srt/models/deepseek_common/deepseek_weight_loader.pytest/manual/test_deepseek_v31.pytest/manual/nightly/test_deepseek_v31_perf.pytest/manual/test_deepseek_chat_templates.py
Open PRs to Track
Check these before declaring a V3.1 gap:
- #17761: missing Assistant token after tool output in V3.1/V3.2 chat templates, open.
- #18236: function-call arguments missing in V3.1 streaming mode, open.
- #21739: NPU deployment docs for V3.1/V3.2, open.
- #22433: CPU unit tests for
DeepSeekV31Detector, open. - #22981: broader CPU tests for missing function-call detectors, open.
- #23336: adaptive speculative-num-steps support for spec v2 EAGLE workers, open and relevant when V3.1 MTP runs with spec v2.
Additional PR Coverage
Additional all-state PR coverage includes parser/template PRs that are relevant to V3.1 even though they are not the core bring-up PRs:
- #9468, #10875, and #11189 refine reasoning/thinking docs, request handling, and eval flags.
- #9895 and #14837 update
tool_chat_template_deepseekv31.jinja. - #10550, #11223, #11589, and #21593 are general tool-choice / constrained-decoding / parser fixes that affect V3.1 serving behavior.
- #17141, #17320, and #17558 are closed attempts around the missing-Assistant-token-after-tool-output issue that remains tracked by open #17761.
- #22950 closed the first parser-gated reasoning radix-cache stripping attempt, while merged #23315 adds the current opt-in strip of thinking tokens from radix cache. For V3.1 thinking mode, this is cache-behavior work rather than a new tool parser.
- #21599 and #22128 are inherited MTP/speculative-decoding infrastructure updates: adaptive EAGLE draft steps and piecewise CUDA graph with speculative decoding.
Evolution Path
Stage V31-0: Split V3.1 tool calling from V3
DeepSeek V3.1 has a distinct tool-call wire format:
<|tool▁calls▁begin|><|tool▁call▁begin|>{name}<|tool▁sep|>{json_args}<|tool▁call▁end|><|tool▁calls▁end|>
It does not use V3's function literal or fenced json block.
Key PR:
Success check:
--tool-call-parser deepseekv31resolves toDeepSeekV31Detector- the template emits V3.1 markers, not V3 markers
- multiple tool calls can be chained without separators
Stage V31-1: Enable hybrid thinking correctly
V3.1 thinking mode is toggled through the chat template, not through the R1 parser.
Key PR:
Rules:
- launch with
--reasoning-parser deepseek-v3 - send
extra_body: {"chat_template_kwargs": {"thinking": true}} - the template emits
<think>when thinking is enabled and</think>when non-thinking is desired
Stage V31-2: Keep chat template argument types stable
V3.1 multi-turn tool calls can pass tool["function"]["arguments"] as either a dict or an already serialized JSON string. The template must not double-escape JSON strings.
Key PR:
Success check:
- dict arguments are rendered through
tojson - string arguments are used as-is
- mixed multi-tool calls keep both forms intact
Stage V31-3: Harden structural tags and streaming deltas
The structural trigger should be the generic per-call begin token, not the full name-specific begin string. Streaming must also preserve arguments that arrive in the first chunk and normal text before the tool marker.
Key PRs:
Success check:
structure_info().triggeris<|tool▁call▁begin|>- streaming emits the function name and the first argument diff when they appear together
- normal text before the first tool marker is not dropped
- CPU parser tests cover invalid JSON, unknown tools, unicode args, multiple calls, and streaming chunks
Stage V31-4: Keep inherited model loading and MTP healthy
V3.1 still depends on DeepSeek V3/R1 loader, MLA, MoE, and MTP surfaces.
Key PRs:
Success check:
test/manual/test_deepseek_v31.pycovers TP8 and TP8+MTP- nightly perf no longer carries stale
enable_dp_attentionflags - loading fixes are checked in the shared DeepSeek V2/V3 loader and in parser code
Stage V31-5: Tune MoE configs as a DeepSeek-family shape
V3.1 shares the DeepSeek MoE shape with V3/V3.2-style fused MoE config work.
Key PR:
Success check:
- H20 and H20-3E configs exist for
E=257,N=256,fp8_w8a8 - config selection is validated on the same hardware lane that motivated the tuning
Validation Surface
Use the narrowest lane that matches the change:
- parser-only: open #22433 test pattern, or equivalent
DeepSeekV31DetectorCPU unit tests - template-only:
test/manual/test_deepseek_chat_templates.py - V3.1 base/MTP:
test/manual/test_deepseek_v31.py - nightly performance:
test/manual/nightly/test_deepseek_v31_perf.py - inherited MLA/MoE/quant bug: switch to
sglang-deepseek-v3-r1-optimization
More from bbuf/sglang-auto-driven-skills
h100
SSH into host `h100_sglang`, enter Docker container `sglang_bbuf`, work in `/sgl-workspace/sglang`, and use the ready H100 remote environment for SGLang development and validation. Use when a task needs remote CUDA work, GPU-backed smoke tests, diffusion checks, or a safe remote copy instead of local-only execution.
25h100-sglang-diffusion
SSH into host `h100_sglang`, enter Docker container `sglang_bbuf`, work in `/data/bbuf/repos/sglang`, and use the ready H100 remote environment for SGLang **diffusion** development and validation. Use when a task needs diffusion model smoke tests, Triton/CUDA kernel validation, torch.compile diffusion checks, or a safe remote copy for diffusion-specific SGLang changes.
24sglang-prod-incident-triage
Replay-first debug flow for SGLang serving problems. Use when a live or recent server shows health-check failures, latency or throughput regressions, queue growth, timeouts, distributed stalls, crash dumps, wrong outputs after deploys, or PD/EP/HiCache issues, and the job is to turn the problem into a replay plus the right next debug tool.
24sglang-minimax-m2-series-optimization
PR-backed and current-main optimization manual for the `MiniMaxAI/MiniMax-M2` series, including M2, M2.1, M2.5, M2.7, and M2.7-highspeed. Use when Codex needs to recover, extend, or audit MiniMax-specific optimizations, TP QK norm/all-reduce behavior, parser contracts, distributed runtime behavior, quantized loading, or backend-specific validation.
15sglang-torch-profiler-analysis
Compact SGLang torch-profiler triage skill. Use when Codex should inspect an existing `trace.json(.gz)` or profile directory, trigger `sglang.profiler` against a live server, and return one compact report with kernel, overlap-opportunity, and fuse-pattern tables. Single-trace triage is enough for quick diagnosis; mapping+formal two-trace triage gives stronger overlap conclusions.
15sglang-kimi-k2-k25-optimization
PR-backed and current-main optimization manual for `moonshotai/Kimi-K2*` and `moonshotai/Kimi-K2.5*` in SGLang. Use when Codex needs to recover, extend, or audit Kimi optimizations, including K2 router/MoE fast paths, K2 thinking Marlin paths, K2.5 wrapper/multimodal/runtime plumbing, W4AFP8/W4A16 quant tracks, parser contracts, LoRA coverage, and backend-specific validation.
15