Auto Recipe — Recipe Index & Recommendation
This skill indexes every shipped recipe and helps users pick the right starting
config, adjust parallelism, and avoid common pitfalls.
How to Use This Skill
- Ask the user for: model name/size, GPU count & type, training goal
(pretrain / SFT / PEFT), and sequence length (if non-default).
- Look up the best-match recipe in the index below.
- Recommend the recipe function name + entry-point command.
- Provide adjustment advice (parallelism resizing, batch tuning, pitfalls).
Entry Points
Library recipes (functional training)
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
--recipe <recipe_function_name> \
--dataset llm-pretrain-mock
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
--recipe <recipe_function_name> \
--dataset llm-finetune
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
--recipe llama3_8b_pretrain_config \
--dataset llm-pretrain-mock \
'model.tensor_model_parallel_size=2' \
'training.global_batch_size=64'
Performance recipes (throughput benchmarks)
python scripts/performance/run_script.py \
--recipe <model_family> \
--gpu_type h100 \
--num_gpus 64 \
--data mock
Perf recipes are NOT fully validated for correctness. Most conversations
and testing were on mock data. They are designed for upper-bound throughput
measurement, not production training. Always validate loss curves and
convergence independently.
Recipe Unification (Coming Soon — PR #2803)
PR #2803 is
unifying performance recipes into the same Python function format used by
library recipes. Key changes:
- Perf recipes move from
scripts/performance/configs/ → src/megatron/bridge/recipes/<family>/<model>_perf.py
- Each perf recipe becomes a self-contained Python function (e.g.
llama3_8b_h100_bf16_pretrain_config())
- The old
WorkloadBaseConfig → set_workload_base_configs → get_perf_optimized_recipe pipeline is removed
- Shared helpers:
_benchmark_common() (50 iters, timing, TE RNG), _perf_precision() (bf16 / fp8_cs / fp8_mx / nvfp4)
Why Python, not YAML? Previous YAML-based approaches had problems:
recipe logic was split across multiple indirection layers, configs were not
self-contained, and the two-level pipeline made maintenance and debugging
difficult. Python functions are explicit, greppable, and composable.
After #2803 lands, both library and perf recipes will be invocable through the
same run_recipe.py entry point.
Library Recipe Index
All recipes live under src/megatron/bridge/recipes/. Each function returns a
ConfigContainer with model, training, optimizer, and data settings.
Llama
| Recipe |
Mode |
TP |
PP |
CP |
SP |
GPUs (min) |
Seq Len |
llama2_7b_pretrain_config |
Pretrain |
2 |
1 |
— |
— |
2 |
4K |
llama3_8b_pretrain_config |
Pretrain |
2 |
1 |
— |
✓ |
2 |
8K |
llama3_8b_16k_pretrain_config |
Pretrain |
2 |
1 |
2 |
✓ |
4 |
16K |
llama3_8b_64k_pretrain_config |
Pretrain |
2 |
1 |
4 |
✓ |
8 |
64K |
llama3_8b_128k_pretrain_config |
Pretrain |
2 |
1 |
8 |
✓ |
16 |
128K |
llama3_70b_pretrain_config |
Pretrain |
8 |
4 |
— |
✓ |
32 |
8K |
llama3_70b_16k_pretrain_config |
Pretrain |
8 |
4 |
2 |
✓ |
64 |
16K |
llama3_70b_64k_pretrain_config |
Pretrain |
8 |
4 |
4 |
✓ |
128 |
64K |
llama31_405b_pretrain_config |
Pretrain |
8 |
16 |
— |
✓ |
128 |
8K |
llama3_8b_sft_config |
SFT |
2 |
1 |
— |
✓ |
2 |
8K |
llama3_70b_sft_config |
SFT |
4 |
4 |
— |
✓ |
16 |
8K |
llama31_405b_sft_config |
SFT |
8 |
8 |
— |
✓ |
64 |
8K |
llama3_8b_peft_config |
PEFT |
1 |
1 |
— |
— |
1 |
8K |
llama3_70b_peft_config |
PEFT |
2 |
4 |
— |
✓ |
8 |
8K |
llama31_405b_peft_config |
PEFT |
4 |
8 |
— |
✓ |
32 |
8K |
Qwen2 / Qwen2.5
| Recipe |
Mode |
TP |
PP |
Sizes |
qwen2_*_{pretrain,sft,peft}_config |
All |
1–8 |
1–4 |
500M, 1.5B, 7B, 14B, 32B, 72B |
qwen25_*_{pretrain,sft,peft}_config |
All |
1–8 |
1–4 |
500M, 1.5B, 3B, 7B, 14B, 32B, 72B |
Qwen3 (Dense)
| Recipe |
Mode |
TP |
PP |
CP |
Sizes |
qwen3_*_pretrain_config |
Pretrain |
1–8 |
1–2 |
— |
600M–32B |
qwen3_*_sft_config |
SFT |
1–8 |
1–2 |
— |
600M–32B |
qwen3_600m_sft_128k_config |
SFT |
1 |
1 |
8 |
600M (128K seq) |
qwen3_*_peft_config |
PEFT |
1 |
1 |
— |
600M–32B |
Qwen3 MoE
| Recipe |
Mode |
TP |
PP |
EP |
CP |
GPUs |
qwen3_30b_a3b_pretrain_config |
Pretrain |
1 |
1 |
8 |
— |
8 |
qwen3_30b_a3b_sft_config |
SFT |
1 |
1 |
8 |
— |
8 |
qwen3_30b_a3b_peft_config |
PEFT |
1 |
1 |
1 |
— |
1 |
qwen3_235b_a22b_pretrain_config |
Pretrain |
4 |
16 |
8 |
2 |
512+ |
qwen3_235b_a22b_sft_config |
SFT |
4 |
8 |
8 |
— |
256 |
qwen3_235b_a22b_peft_config |
PEFT |
1 |
4 |
4 |
— |
16 |
Qwen3-Next
| Recipe |
Mode |
TP |
PP |
EP |
qwen3_next_80b_a3b_pretrain_config |
Pretrain |
1 |
4 |
8 |
qwen3_next_80b_a3b_sft_config |
SFT |
1 |
2 |
8 |
qwen3_next_80b_a3b_peft_config |
PEFT |
1 |
1 |
4 |
DeepSeek
| Recipe |
Mode |
TP |
PP |
EP |
GPUs |
deepseek_v2_lite_pretrain_config |
Pretrain |
1 |
1 |
8 |
8 |
deepseek_v2_pretrain_config |
Pretrain |
1 |
4 |
32 |
128 |
deepseek_v3_pretrain_config |
Pretrain |
2 |
16 |
64 |
2048 |
deepseek_v3_pretrain_config_32nodes |
Pretrain |
2 |
8 |
32 |
256 |
GLM-4.5
| Recipe |
Mode |
TP |
PP |
EP |
GPUs |
glm45_355b_pretrain_config |
Pretrain |
2 |
8 |
16 |
256 |
glm45_air_106b_pretrain_config |
Pretrain |
1 |
4 |
8 |
32 |
glm45_355b_sft_config |
SFT |
2 |
8 |
16 |
256 |
glm45_air_106b_sft_config |
SFT |
1 |
4 |
8 |
32 |
glm45_355b_peft_config |
PEFT |
2 |
4 |
4 |
32 |
glm45_air_106b_peft_config |
PEFT |
1 |
2 |
4 |
8 |
Gemma
| Recipe |
Mode |
TP |
PP |
Sizes |
gemma2_*_{pretrain,sft,peft}_config |
All |
2–8 |
1–2 |
2B, 9B, 27B |
gemma3_1b_{pretrain,sft,peft}_config |
All |
1 |
1 |
1B (32K seq) |
NemotronH / Nemotron
| Recipe |
Mode |
TP |
PP |
EP |
Notes |
nemotronh_{4b,8b,47b,56b}_*_config |
P/S/PEFT |
1–8 |
1–4 |
— |
Dense SSM-hybrid |
nemotron_3_nano_*_config |
P/S/PEFT |
varies |
1 |
8 |
MoE + Mamba |
nemotron_3_super_*_config |
P/S/PEFT |
4 |
1 |
8 |
MoE + Mamba, ~40% CUDA graph gain |
nemotron_nano_{9b,12b}_v2_*_config |
P/S/PEFT |
varies |
1 |
— |
Dense |
Other Models
| Recipe |
Mode |
Notes |
moonlight_16b_{pretrain,sft,peft}_config |
All |
MoE EP=8 |
olmoe_7b_{pretrain,sft,peft}_config |
All |
MoE EP=8 |
ministral3_{3b,8b,14b}_{sft,peft}_config |
SFT/PEFT |
Dense |
gpt_oss_20b_*_config |
All |
MoE + FP8/MXFP8 variants |
gpt_oss_120b_*_config |
All |
MoE |
vanilla_gpt_pretrain_config |
Pretrain |
MLM/Bridge parity baseline |
gpt3_175b_pretrain_config |
Pretrain |
TP=4, PP=8, VP=6 |
kimi_k2_pretrain_config |
Pretrain |
1T MoE, TP=2 PP=16 EP=32 |
VLM Recipes
| Recipe |
Mode |
TP |
PP |
EP |
GPUs |
gemma3_vl_{4b,12b,27b}_{sft,peft}_config |
SFT/PEFT |
1–8 |
1–2 |
— |
1–16 |
qwen25_vl_{3b,7b,32b,72b}_{sft,peft}_config |
SFT/PEFT |
1–8 |
1–4 |
— |
1–32 |
qwen3_vl_{8b,30b_a3b,235b_a22b}_{sft,peft}_config |
SFT/PEFT |
1–4 |
1–8 |
1–32 |
1–512 |
qwen35_vl_*_{sft,peft}_config |
SFT/PEFT |
varies |
varies |
varies |
varies |
glm_45v_{sft,peft}_config |
SFT/PEFT |
1 |
8 |
4–16 |
64–512 |
nemotron_nano_v2_vl_12b_{sft,peft}_config |
SFT/PEFT |
2–4 |
1 |
— |
8 |
Diffusion Recipes
| Recipe |
Mode |
TP |
CP |
wan_1_3B_{pretrain,sft}_config |
P/SFT |
1 |
8 |
wan_14B_{pretrain,sft}_config |
P/SFT |
2 |
4 |
flux_12b_{pretrain,sft}_config |
P/SFT |
2 |
1 |
Performance Recipe Index
All perf recipes live under scripts/performance/. They are invoked via
run_script.py and use WorkloadBaseConfig presets per GPU type.
Important: Perf recipes are designed for upper-bound throughput
benchmarks, not production training. They run 50 iterations on mock
data by default. Throughput numbers are aspirational targets, not validated
convergence configs.
Llama 3 / 3.1
| Model |
GPUs |
GPU Types |
Key Features |
| Llama 3 8B |
8 |
H100, B200, B300, GB200, GB300, R100 |
CUDA graphs (local), FSDP on GB variants |
| Llama 3 70B |
64 |
H100, B200, B300, GB200, GB300 |
TP comm overlap (userbuffers), FSDP, CUDA graphs |
| Llama 3.1 405B |
128–1024 |
H100, B200, B300, GB200, GB300 |
TP+CP comm overlap (userbuffers), FSDP, heavy PP/VP |
SFT/LoRA variants also exist (e.g. 8B SFT with packed sequences, 70B SFT on 32 GPUs).
DeepSeek V3
| Model |
GPUs |
GPU Types |
Key Features |
| DeepSeek V3 (671B MoE) |
256–1024 |
H100, B200, B300, GB200, GB300 |
HybridEP dispatcher, MLA recompute, CUDA graphs (TE scoped) |
Qwen3 MoE
| Model |
GPUs |
GPU Types |
Key Features |
| Qwen3 30B-A3B |
8–16 |
H100, B200, B300, GB200, GB300 |
MoE alltoall/flex dispatcher |
| Qwen3 235B-A22B |
64–256 |
H100, B200, B300, GB200, GB300 |
TP comm overlap, CUDA graphs, MoE a2a overlap |
| Qwen3-Next 80B-A3B |
64–128 |
H100, B200, B300, GB200, GB300 |
EP 64–128 |
Qwen3-VL
| Model |
GPUs |
GPU Types |
Key Features |
| Qwen3-VL 30B-A3B |
8–16 |
H100, B200, B300, GB200, GB300 |
VLM + MoE |
| Qwen3-VL 235B-A22B |
64–256 |
H100, B200, B300, GB200, GB300 |
VLM + MoE, TP comm overlap |
Kimi K2
| Model |
GPUs |
GPU Types |
Key Features |
| Kimi K2 (1T MoE) |
256–1024 |
H100, B200, B300, GB200, GB300 |
Muon/Adam optimizer, HybridEP, pipeline layout helpers |
NemotronH
| Model |
GPUs |
GPU Types |
Key Features |
| Nemotron 3 Nano (30B MoE+Mamba) |
8–16 |
H100, B200, B300, GB200, GB300 |
TE CUDA graphs (attn+mamba+moe), HybridEP |
| Nemotron 3 Super |
64 |
H100, B200, B300, GB200, GB300 |
TE CUDA graphs, EP=64 |
| NemotronH 56B |
64 |
H100, B200, B300 |
TP=2–8, TE graphs (mamba+attn) |
GPT-OSS
| Model |
GPUs |
GPU Types |
Key Features |
| GPT-OSS 120B |
64 |
H100, B200, GB200 |
EP=64, HybridEP on GB200 |
Recommendation Decision Tree
User wants to train a model
│
├─ Know the model name?
│ ├─ Yes → Look up in Library Recipe Index above
│ │ ├─ Has a recipe for their size + mode? → Use it directly
│ │ └─ No exact match? → Use closest size, adjust parallelism
│ └─ No → Ask for model name, size, and HF model ID
│
├─ What's the training goal?
│ ├─ Pretrain → Use *_pretrain_config
│ ├─ SFT (full fine-tune) → Use *_sft_config
│ └─ PEFT (LoRA/DoRA) → Use *_peft_config (lowest GPU requirement)
│
├─ How many GPUs?
│ ├─ 1 GPU → Only PEFT recipes work (TP=1, PP=1)
│ ├─ 8 GPUs (1 node) → Most 8B–16B models, small MoE (EP=8)
│ ├─ 16–64 GPUs → 70B dense, medium MoE
│ └─ 128+ GPUs → 405B+, large MoE (DeepSeek V3, Kimi K2)
│
├─ Want throughput benchmarks?
│ ├─ Yes → Use perf recipes (scripts/performance/)
│ │ └─ ⚠️ These run on mock data for upper-bound perf only
│ └─ No → Use library recipes (scripts/training/run_recipe.py)
│
└─ Long context?
├─ > 8K → Need CP (context parallelism), check *_16k / *_64k / *_128k variants
└─ ≤ 8K → Default recipes work
Adjustment Advice (When Recommending)
Parallelism Resizing Rules
When the user's GPU count differs from the recipe default:
- TP must divide
num_key_value_heads (GQA constraint). E.g. if
num_key_value_heads=8, valid TP = {1, 2, 4, 8}.
- TP should stay within a single node (NVLink). TP > 8 requires
inter-node NVLink (e.g., GB200 NVL72).
- PP adds pipeline bubbles. Minimize PP; only increase when TP alone can't
fit the model. Use VP (virtual pipeline) to mitigate bubble overhead.
- EP doesn't reduce dense-layer memory. Only expert parameters shard with
EP. Shared attention/embeddings are replicated. For "OOM with MoE", increase
EP first, not TP.
- SP should be True whenever TP > 1. It eliminates redundant activation
copies and is essentially free.
- CP requires all-to-all or ring attention. Check
cp_comm_type. For
GQA models, a2a+p2p hierarchical CP allows CP > num_kv_heads.
- world_size = DP × TP × PP × CP × EP. DP is implicit. Make sure the
product of explicit parallelisms divides your total GPU count.
Batch Size Tuning
- Start with the recipe's
micro_batch_size. If OOM, reduce to 1.
global_batch_size determines learning dynamics. Scale with DP:
GBS = micro_batch_size × DP × gradient_accumulation_steps.
- For MoE,
micro_batch_size=1 is typical at scale.
Common Pitfalls to Warn About
| Pitfall |
Symptom |
Fix |
| TP > num_kv_heads |
Crash: "TP must divide num_query_groups" |
Reduce TP to a divisor of num_kv_heads |
| PP without VP |
Poor throughput (large bubble) |
Set virtual_pipeline_model_parallel_size |
| EP too low for large MoE |
OOM on expert params |
Increase EP; each expert lives on EP/num_experts ranks |
| CUDA graphs + packed sequences |
Assert: "CUDA graph accepts only Tensor inputs" |
Disable packing or use local full-iteration graphs |
| CUDA graphs + full recompute |
Assert: "full recompute only with full iteration CUDA graph" |
Disable recompute or switch to local impl |
use_te_rng_tracker not set |
Assert on provider init when CUDA graphs enabled |
Set cfg.model.use_te_rng_tracker = True and cfg.rng.te_rng_tracker = True |
| FSDP + TP > 1 on H100 |
Possible comm bottleneck |
Prefer FSDP with TP=1 or TP=2 on H100; FSDP shines on GB/B-series |
| Long context without CP |
OOM on activations |
Add CP=2/4/8; use *_16k, *_64k, or *_128k recipe variants |
MoE overlap_grad_reduce on H100 |
May hurt perf (False in many H100 presets) |
Set overlap_grad_reduce=False for MoE on H100 |
| VLM SFT missing image data |
Runs but produces garbage |
Provide actual multimodal dataset or use mock VLM data |
| Qwen35-VL MoE FSDP |
Tested on Blackwell only |
May not work on H100; validate first |
Recipe Override Examples
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
--recipe llama3_8b_pretrain_config \
--dataset llm-pretrain-mock
uv run torchrun --nproc_per_node=4 scripts/training/run_recipe.py \
--recipe qwen3_30b_a3b_sft_config \
--dataset llm-finetune \
'model.expert_model_parallel_size=4'
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
--recipe llama3_8b_pretrain_config \
--dataset llm-pretrain-mock \
'model.seq_length=32768' \
'model.context_parallel_size=4'
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
--recipe qwen3_30b_a3b_pretrain_config \
--dataset llm-pretrain-mock \
'model.cuda_graph_impl=transformer_engine' \
'model.cuda_graph_scope=[attn,moe_router,moe_preprocess]' \
'model.use_te_rng_tracker=True' \
'rng.te_rng_tracker=True'
Quick Reference: Which Recipe for My Situation?
| I want to... |
Start with |
GPUs needed |
| Try Bridge for the first time |
llama3_8b_sft_config + mock data |
2 |
| Fine-tune a 7-8B model |
llama3_8b_sft_config or qwen3_8b_sft_config |
2–8 |
| LoRA on 1 GPU |
llama3_8b_peft_config or qwen3_8b_peft_config |
1 |
| Pretrain a dense 70B |
llama3_70b_pretrain_config |
32–64 |
| Train a small MoE |
qwen3_30b_a3b_pretrain_config |
8 |
| Train a large MoE (235B+) |
qwen3_235b_a22b_pretrain_config |
256–512 |
| Benchmark throughput |
Perf recipes via run_script.py |
Varies |
| Long-context training |
llama3_8b_128k_pretrain_config or add CP override |
16+ |
| VLM fine-tuning |
qwen3_vl_8b_sft_config or gemma3_vl_*_sft_config |
4–8 |
| Diffusion training |
wan_1_3B_pretrain_config or flux_12b_pretrain_config |
8 |
Code Anchors
| What |
Path |
| Library recipes root |
src/megatron/bridge/recipes/ |
Recipe __init__.py (all exports) |
src/megatron/bridge/recipes/__init__.py |
| Common recipe helpers |
src/megatron/bridge/recipes/common.py |
| Training entry point |
scripts/training/run_recipe.py |
| Perf recipes root |
scripts/performance/ |
| Perf entry point |
scripts/performance/run_script.py |
| Perf workload configs |
scripts/performance/configs/<family>/ |
| Perf overrides (benchmark defaults) |
scripts/performance/utils/overrides.py |