slime-user
SLIME User Guide
SLIME is an LLM post-training framework for RL Scaling developed by THUDM. It supports various RL algorithms (GRPO, GSPO, PPO, Reinforce++), multiple training backends (Megatron, FSDP), and advanced features like multi-turn interactions, tool calling, and dynamic sampling.
Quick Start Workflow
For First-Time Users
-
Environment Setup
- Use Docker:
docker pull slimerl/slime:latest - Or build from source: See
docs/en/get_started/quick_start.md - Hardware: Supports H100/H200, B200 series
- Use Docker:
-
Download Model and Data
hf download Qwen/Qwen3-4B --local-dir /root/Qwen3-4B hf download --repo-type dataset zhuzilin/dapo-math-17k --local-dir /root/dapo-math-17k -
Convert Weights (Megatron backend only)
source scripts/models/qwen3-4B.sh PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \ ${MODEL_ARGS[@]} \ --hf-checkpoint /root/Qwen3-4B \ --save /root/Qwen3-4B_torch_dist -
Run Training
bash scripts/run-qwen3-4B.sh
For Experienced Users
When user needs specific functionality:
- Multi-turn/tool calling: Read references/examples_reference.md Search-R1 section
- Custom reward models: See custom RM pattern in examples reference
- FSDP instead of Megatron: Use
--train-backend fsdp, skip weight conversion - Large-scale training: See multi-node examples (GLM-4.5, DeepSeek-R1)
- Source code exploration: Check references/source_code_reference.md
Documentation Navigation
SLIME has extensive documentation. Use this guide to find what you need quickly.
Essential Documentation (Read These First)
- Quick Start Guide:
docs/en/get_started/quick_start.md- Setup and first training run - Usage Guide:
docs/en/get_started/usage.md- Comprehensive parameter reference - Example Docs:
docs/en/examples/qwen3-4B.mdordocs/en/examples/glm4-9B.md
For detailed navigation of all documentation, see references/doc_navigation.md.
Common Tasks → Documentation Mapping
| Task | Documentation |
|---|---|
| First-time setup | docs/en/get_started/quick_start.md |
| Understanding parameters | docs/en/get_started/usage.md |
| Basic training (8 GPUs) | docs/en/examples/qwen3-4B.md |
| Multi-turn tool use | examples/search-r1/ |
| Custom generation logic | docs/en/get_started/customization.md |
| Multi-node training | docs/en/examples/glm4.5-355B-A32B.md |
| FSDP backend | docs/en/get_started/usage.md (FSDP section) |
| VLM training | examples/geo3k_vlm/ |
| Troubleshooting | docs/en/get_started/qa.md |
Core Concepts
Training Loop
SLIME uses a "Rollout → Train" loop:
- Rollout: Generate responses using SGLang inference
- Reward: Compute rewards using reward model
- Train: Update model weights using Megatron/FSDP
- Repeat for
--num-rolloutiterations
Key Constraint
rollout-batch-size × n-samples-per-prompt = global-batch-size × num-steps-per-rollout
Resource Allocation Modes
Colocated (training and inference share GPUs):
--actor-num-nodes 1 \
--actor-num-gpus-per-node 8 \
--colocate \
--sglang-mem-fraction-static 0.7
Disaggregated (separate GPUs for training/inference):
--actor-num-nodes 1 \
--actor-num-gpus-per-node 4 \
--rollout-num-gpus 4
Parameter Quick Reference
Essential Parameters
Model Loading:
--hf-checkpoint: HuggingFace model path (for SGLang and FSDP)--ref-load: Megatron reference model checkpoint--load: Megatron actor checkpoint (resume training)--save: Save path for checkpoints
Data:
--prompt-data: JSONL dataset path--input-key: Field name for prompts (default: "prompt")--label-key: Field name for labels (default: "label")--metadata-key: Field name for metadata (default: "metadata")--apply-chat-template: Apply tokenizer chat template
Rollout:
--rollout-batch-size: Prompts per rollout--n-samples-per-prompt: Responses per prompt--rollout-max-response-len: Max response length--rollout-temperature: Sampling temperature
Training:
--num-rollout: Total training iterations--num-steps-per-rollout: Optimizer steps per rollout (default: 1)--global-batch-size: Samples per optimizer step--advantage-estimator: RL algorithm (grpo, gspo, ppo, reinforce_plus_plus)
Reward Model:
--rm-type: Built-in RM type (e.g., "deepscaler")--custom-rm-path: Custom RM function path
Backends:
--train-backend: Training backend (megatron or fsdp)--rollout-num-gpus-per-engine: GPUs per SGLang engine (like tp_size)
For complete parameter reference, see docs/en/get_started/usage.md.
Common Workflows
1. Standard Single-Turn Training
Use example scripts as templates:
scripts/run-qwen3-4B.sh: Basic 8xH100 setupscripts/run-glm4-9B.sh: With dynamic sampling
Key sections in script:
# Load model config
source scripts/models/qwen3-4B.sh
# Configure checkpoints
CKPT_ARGS=(--hf-checkpoint /root/Qwen3-4B ...)
# Configure rollout
ROLLOUT_ARGS=(
--rollout-batch-size 32
--n-samples-per-prompt 8
--rm-type deepscaler
)
# Configure algorithm
GRPO_ARGS=(--advantage-estimator grpo ...)
# Run training
ray job submit ... -- python3 train.py \
${MODEL_ARGS[@]} ${CKPT_ARGS[@]} ${ROLLOUT_ARGS[@]} ...
2. Multi-Turn Tool Calling
For multi-turn scenarios (like Search-R1):
-
Prepare Data with metadata:
{ "question": "User query", "final_answer": "Expected answer", "metadata": "{\"session_id\": \"123\", \"tool_code\": \"...\"}" } -
Implement Custom Generation Function:
async def generate(args, sample: Sample, sampling_params) -> Sample: for turn in range(max_turns): # Generate action model_output = await call_sglang(...) sample.loss_mask += [1] * len(model_tokens) # Train on actions # Execute tool tool_output = await execute_tool(...) sample.loss_mask += [0] * len(tool_tokens) # Mask tool outputs if action == "answer": break sample.tokens = prompt_tokens + response_tokens sample.response_length = len(response_tokens) return sample -
Configure Custom Functions:
--custom-generate-function-path my_module.generate \ --custom-rm-path my_module.reward_func \ --metadata-key metadata
See examples/search-r1/ for complete example.
3. Dynamic Sampling (DAPO-style)
Filter low-quality samples during generation:
ROLLOUT_ARGS+=(
--over-sampling-batch-size 64 \
--rollout-batch-size 32 \
--dynamic-sampling-filter-path \
slime.rollout.filter_hub.dynamic_sampling_filters.check_reward_nonzero_std
)
How it works:
- Samples 64 prompts (over-sampling)
- Filters groups based on reward diversity
- Keeps only 32 prompts × 8 samples that pass filter
- Automatically resamples if too many filtered out
4. FSDP Backend (No Weight Conversion)
--train-backend fsdp \
--hf-checkpoint /root/Qwen3-4B \
--gradient-checkpointing \
--context-parallel-size 2
Benefits:
- No HF → Megatron weight conversion needed
- Directly load HuggingFace checkpoints
- Simpler setup for supported models
See examples/geo3k_vlm/ and docs/en/get_started/usage.md FSDP section.
5. Multi-Node Training
-
Start Ray cluster:
# Head node ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus 8 # Worker nodes ray start --address=${MASTER_ADDR}:6379 --num-gpus 8 -
Submit job:
ray job submit --address="http://127.0.0.1:8265" \ --runtime-env-json='{"env_vars": {"PYTHONPATH": "/root/Megatron-LM/"}}' \ -- python3 train.py \ --actor-num-nodes 8 \ --actor-num-gpus-per-node 8 \ ...
See docs/en/examples/glm4.5-355B-A32B.md for large-scale example.
Customization Guide
Custom Reward Model
Implement async function:
async def my_reward_func(args, sample: Sample, **kwargs) -> float:
# Access sample fields
prompt = sample.prompt
response = sample.response
label = sample.label
# Compute reward
reward = compute_score(response, label)
return float(reward)
Use with: --custom-rm-path module.path:my_reward_func
Custom Generation Function
Implement async function:
async def my_generate(args, sample: Sample, sampling_params) -> Sample:
# Load tokenizer
from slime.utils.processing_utils import load_tokenizer
tokenizer = load_tokenizer(args.hf_checkpoint, trust_remote_code=True)
# Generate response (call SGLang API or custom logic)
from slime.utils.http_utils import post
output = await post(
f"http://{args.sglang_router_ip}:{args.sglang_router_port}/generate",
{"text": sample.prompt, "sampling_params": sampling_params}
)
# Set sample fields
prompt_tokens = tokenizer(sample.prompt, add_special_tokens=False)["input_ids"]
response_tokens = tokenizer(output["text"], add_special_tokens=False)["input_ids"]
sample.tokens = prompt_tokens + response_tokens
sample.response_length = len(response_tokens)
sample.response = output["text"]
sample.truncated = output["meta_info"]["finish_reason"]["type"] == "length"
return sample
Use with: --custom-generate-function-path module.path:my_generate
Custom Dynamic Filter
Implement filter function:
def my_filter(args, samples: list[Sample], **kwargs) -> bool:
# Return True to keep samples, False to discard
return all(sample.reward > 0.5 for sample in samples)
Use with: --dynamic-sampling-filter-path module.path:my_filter
Examples Reference
For detailed examples and patterns, see references/examples_reference.md.
Quick finder:
- Basic math training:
scripts/run-qwen3-4B.sh - Multi-turn tool use:
examples/search-r1/ - Vision-language RL:
examples/geo3k_vlm/ - Large-scale MOE:
docs/en/examples/glm4.5-355B-A32B.md - Custom generation:
examples/search-r1/search_r1_logic.py - FSDP backend:
examples/geo3k_vlm/
Source Code Reference
For source code exploration, see references/source_code_reference.md.
Key files:
- Arguments:
slime/utils/arguments.py - Rollout:
slime/rollout/sglang_rollout.py - Sample type:
slime/utils/types.py - Reward models:
slime/rollout/rm_hub/ - Conversion tools:
tools/convert_hf_to_torch_dist.py
Troubleshooting
Common Issues
OOM during colocated training:
- Reduce
--sglang-mem-fraction-static(try 0.7 or 0.6) - Reduce
--max-tokens-per-gpu - Enable gradient checkpointing:
--recompute-granularity full
Mismatched batch sizes:
- Ensure:
rollout-batch-size × n-samples-per-prompt = global-batch-size × num-steps-per-rollout
Weight conversion errors:
- Check model config matches exactly (e.g.,
--rotary-base) - Use FSDP backend to skip conversion:
--train-backend fsdp
Multi-node communication issues:
- Set environment variables:
GLOO_SOCKET_IFNAME,NCCL_SOCKET_IFNAME - See
docs/en/get_started/quick_start.mdmulti-node section
SGLang concurrency issues:
- Limit concurrency:
--sglang-server-concurrency 160 - Increase CUDA graphs:
--sglang-cuda-graph-bs 1 2 4 8 $(seq 16 8 256)
For more troubleshooting, see docs/en/get_started/qa.md.
Additional Resources
Reference Files
- Doc Navigation: references/doc_navigation.md - Find documentation quickly
- Examples Reference: references/examples_reference.md - Example scripts and patterns
- Source Code Reference: references/source_code_reference.md - Code structure and key functions
External Links
- GitHub Repository: https://github.com/THUDM/slime
- Docker Image:
slimerl/slime:latest - Megatron-LM: https://github.com/NVIDIA/Megatron-LM
- SGLang: https://github.com/sgl-project/sglang
More from yzlnew/infra-skills
tikz-flowchart
Creates professional TikZ flowcharts with standardized themes, including Google Material-like and Anthropic-inspired options.
106tilelang-developer
Write, optimize, and debug high-performance AI compute kernels using TileLang (a Python DSL for GPU programming). Use when the user requests: (1) Writing custom GPU kernels for AI workloads (GEMM, Attention, MLA, etc.), (2) Optimizing existing TileLang code for NVIDIA, AMD, or Ascend hardware, (3) Implementing non-standard operators (like DeepSeek MLA, FlashAttention variants), (4) Debugging TileLang compilation or runtime errors, or (5) Cross-platform kernel development targeting multiple GPU vendors.
13megatron-memory-estimator
Estimate GPU memory usage for Megatron-based MoE (Mixture of Experts) and dense models. Use when users need to (1) estimate memory from HuggingFace model configs (DeepSeek-V3, Qwen, etc.), (2) plan GPU resource allocation for training, (3) compare different parallelism strategies (TP/PP/EP/CP), (4) determine if a model fits in available GPU memory, or (5) optimize training configurations for memory efficiency.
11material-you-slides
Create presentation slides using Material You (Material Design 3) style. Generates 1280x720 HTML slides with M3 color tokens, Roboto typography, rounded cards, flow diagrams, metric cards, code blocks, and structured layouts. Use when the user asks to create slides, presentations, or decks and wants a clean, modern Material Design 3 aesthetic.
3