sglang-deepseek-v31-optimization

Installation

SKILL.md

SGLang DeepSeek V3.1 Optimization

Overview

This skill covers the DeepSeek V3.1 support and optimization ladder that is active in SGLang main. V3.1 shares the DeepSeek V3/R1 model backbone, but its tool-call format, chat template, thinking flag, streaming parser, and validation lanes are separate enough to require an independent skill.

Current-main snapshot:

SGLang origin/main: 929e00eea on 2026-04-21
sgl-cookbook origin/main: 8ec4d03 on 2026-04-21
core runtime entry: python/sglang/srt/models/deepseek_v2.py
V3.1 tool parser: python/sglang/srt/function_call/deepseekv31_detector.py
V3.1 tool template: examples/chat_template/tool_chat_template_deepseekv31.jinja

The historical evidence lives in:

references/pr-history.md: chronological PR evidence and code-level notes
references/playbook.md: investigation order, symptom mapping, validation commands

Before You Change Anything

Record the exact serving shape first:

model: deepseek-ai/DeepSeek-V3.1, DeepSeek-V3.1-Terminus, or DeepSeek-V3.1-Speciale
whether thinking mode is enabled with chat_template_kwargs.thinking
whether tool calling is enabled with --tool-call-parser deepseekv31
whether the V3.1 tool template is used: examples/chat_template/tool_chat_template_deepseekv31.jinja
--reasoning-parser deepseek-v3
TP / DP / EP / PP topology
MTP enabled or not
backend and quantization inherited from the DeepSeek V3/R1 backbone
streaming or non-streaming OpenAI API path
exact test lane: manual, nightly, parser unit, chat-template unit, or model-backed accuracy/perf

Core Principle

Do not debug V3.1 as ordinary V3.

The model class is shared with V3/R1, so MLA, MoE, quantization, DeepEP, and MTP bugs usually belong to sglang-deepseek-v3-r1-optimization.
The user-visible V3.1 delta is parser and template behavior: hybrid thinking, tool calling, streaming deltas, and structural-tag constrained decoding.
V3.1 tool calls do not use V3's literal function marker or fenced JSON block.
V3.1 uses chat_template_kwargs: {"thinking": true} with --reasoning-parser deepseek-v3, not R1's parser.
DeepSeek-V3.1-Speciale should not be treated as a tool-calling target.

The optimization order matters:

confirm the parser/template contract
confirm streaming and non-streaming parity
confirm thinking mode and </think> template behavior
confirm model loading and MTP inherited from V3/R1
only then tune MoE/backend performance
add CPU parser tests for parser-only changes before running model-backed tests

Main Runtime Surfaces

Start from these files before changing behavior:

python/sglang/srt/function_call/deepseekv31_detector.py
python/sglang/srt/function_call/function_call_parser.py
examples/chat_template/tool_chat_template_deepseekv31.jinja
python/sglang/srt/entrypoints/openai/serving_chat.py
python/sglang/srt/parser/reasoning_parser.py
python/sglang/srt/managers/schedule_batch.py
python/sglang/srt/mem_cache/common.py
python/sglang/srt/models/deepseek_v2.py
python/sglang/srt/models/deepseek_common/deepseek_weight_loader.py
test/manual/test_deepseek_v31.py
test/manual/nightly/test_deepseek_v31_perf.py
test/manual/test_deepseek_chat_templates.py

Open PRs to Track

Check these before declaring a V3.1 gap:

#17761: missing Assistant token after tool output in V3.1/V3.2 chat templates, open.
#18236: function-call arguments missing in V3.1 streaming mode, open.
#21739: NPU deployment docs for V3.1/V3.2, open.
#22433: CPU unit tests for DeepSeekV31Detector, open.
#22981: broader CPU tests for missing function-call detectors, open.
#23336: adaptive speculative-num-steps support for spec v2 EAGLE workers, open and relevant when V3.1 MTP runs with spec v2.

Additional PR Coverage

Additional all-state PR coverage includes parser/template PRs that are relevant to V3.1 even though they are not the core bring-up PRs:

#9468, #10875, and #11189 refine reasoning/thinking docs, request handling, and eval flags.
#9895 and #14837 update tool_chat_template_deepseekv31.jinja.
#10550, #11223, #11589, and #21593 are general tool-choice / constrained-decoding / parser fixes that affect V3.1 serving behavior.
#17141, #17320, and #17558 are closed attempts around the missing-Assistant-token-after-tool-output issue that remains tracked by open #17761.
#22950 closed the first parser-gated reasoning radix-cache stripping attempt, while merged #23315 adds the current opt-in strip of thinking tokens from radix cache. For V3.1 thinking mode, this is cache-behavior work rather than a new tool parser.
#21599 and #22128 are inherited MTP/speculative-decoding infrastructure updates: adaptive EAGLE draft steps and piecewise CUDA graph with speculative decoding.

Evolution Path

Stage V31-0: Split V3.1 tool calling from V3

DeepSeek V3.1 has a distinct tool-call wire format:

<｜tool▁calls▁begin｜><｜tool▁call▁begin｜>{name}<｜tool▁sep｜>{json_args}<｜tool▁call▁end｜><｜tool▁calls▁end｜>

It does not use V3's function literal or fenced json block.

Key PR:

#9446

Success check:

--tool-call-parser deepseekv31 resolves to DeepSeekV31Detector
the template emits V3.1 markers, not V3 markers
multiple tool calls can be chained without separators

Stage V31-1: Enable hybrid thinking correctly

V3.1 thinking mode is toggled through the chat template, not through the R1 parser.

Key PR:

#9464

Rules:

launch with --reasoning-parser deepseek-v3
send extra_body: {"chat_template_kwargs": {"thinking": true}}
the template emits <think> when thinking is enabled and </think> when non-thinking is desired

Stage V31-2: Keep chat template argument types stable

V3.1 multi-turn tool calls can pass tool["function"]["arguments"] as either a dict or an already serialized JSON string. The template must not double-escape JSON strings.

Key PR:

#12123

Success check:

dict arguments are rendered through tojson
string arguments are used as-is
mixed multi-tool calls keep both forms intact

Stage V31-3: Harden structural tags and streaming deltas

The structural trigger should be the generic per-call begin token, not the full name-specific begin string. Streaming must also preserve arguments that arrive in the first chunk and normal text before the tool marker.

Key PRs:

#13394
#18236, open
#22433, open

Success check:

structure_info().trigger is <｜tool▁call▁begin｜>
streaming emits the function name and the first argument diff when they appear together
normal text before the first tool marker is not dropped
CPU parser tests cover invalid JSON, unknown tools, unicode args, multiple calls, and streaming chunks

Stage V31-4: Keep inherited model loading and MTP healthy

V3.1 still depends on DeepSeek V3/R1 loader, MLA, MoE, and MTP surfaces.

Key PRs:

Success check:

test/manual/test_deepseek_v31.py covers TP8 and TP8+MTP
nightly perf no longer carries stale enable_dp_attention flags
loading fixes are checked in the shared DeepSeek V2/V3 loader and in parser code

Stage V31-5: Tune MoE configs as a DeepSeek-family shape

V3.1 shares the DeepSeek MoE shape with V3/V3.2-style fused MoE config work.

Key PR:

#17133

Success check:

H20 and H20-3E configs exist for E=257,N=256,fp8_w8a8
config selection is validated on the same hardware lane that motivated the tuning

Validation Surface

Use the narrowest lane that matches the change:

parser-only: open #22433 test pattern, or equivalent DeepSeekV31Detector CPU unit tests
template-only: test/manual/test_deepseek_chat_templates.py
V3.1 base/MTP: test/manual/test_deepseek_v31.py
nightly performance: test/manual/nightly/test_deepseek_v31_perf.py
inherited MLA/MoE/quant bug: switch to sglang-deepseek-v3-r1-optimization

Related skills

More from bbuf/sglang-auto-driven-skills

Installs

Repository

bbuf/sglang-aut…n-skills

GitHub Stars

202

First Seen

Apr 23, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykPass

sglang-deepseek-v31-optimization

SGLang DeepSeek V3.1 Optimization

Overview

Before You Change Anything

Core Principle

Main Runtime Surfaces

Open PRs to Track

Additional PR Coverage

Evolution Path

Stage V31-0: Split V3.1 tool calling from V3

Stage V31-1: Enable hybrid thinking correctly

Stage V31-2: Keep chat template argument types stable

Stage V31-3: Harden structural tags and streaming deltas

Stage V31-4: Keep inherited model loading and MTP healthy

Stage V31-5: Tune MoE configs as a DeepSeek-family shape

Validation Surface

More from bbuf/sglang-auto-driven-skills

h100

h100-sglang-diffusion

sglang-prod-incident-triage

sglang-minimax-m2-series-optimization

sglang-torch-profiler-analysis

sglang-kimi-k2-k25-optimization