recipe-recommender

Installation
SKILL.md

Auto Recipe — Recipe Index & Recommendation

This skill indexes every shipped recipe and helps users pick the right starting config, adjust parallelism, and avoid common pitfalls.

How to Use This Skill

  1. Ask the user for: model name/size, GPU count & type, training goal (pretrain / SFT / PEFT), and sequence length (if non-default).
  2. Look up the best-match recipe in the index below.
  3. Recommend the recipe function name + entry-point command.
  4. Provide adjustment advice (parallelism resizing, batch tuning, pitfalls).

Entry Points

Library recipes (functional training)

# Pretrain with mock data
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
    --recipe <recipe_function_name> \
    --dataset llm-pretrain-mock

# SFT with SQuAD
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
    --recipe <recipe_function_name> \
    --dataset llm-finetune

# Override any field via CLI
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
    --recipe llama3_8b_pretrain_config \
    --dataset llm-pretrain-mock \
    'model.tensor_model_parallel_size=2' \
    'training.global_batch_size=64'

Performance recipes (throughput benchmarks)

python scripts/performance/run_script.py \
    --recipe <model_family> \
    --gpu_type h100 \
    --num_gpus 64 \
    --data mock

Perf recipes are NOT fully validated for correctness. Most conversations and testing were on mock data. They are designed for upper-bound throughput measurement, not production training. Always validate loss curves and convergence independently.


Recipe Unification (Coming Soon — PR #2803)

PR #2803 is unifying performance recipes into the same Python function format used by library recipes. Key changes:

  • Perf recipes move from scripts/performance/configs/src/megatron/bridge/recipes/<family>/<model>_perf.py
  • Each perf recipe becomes a self-contained Python function (e.g. llama3_8b_h100_bf16_pretrain_config())
  • The old WorkloadBaseConfigset_workload_base_configsget_perf_optimized_recipe pipeline is removed
  • Shared helpers: _benchmark_common() (50 iters, timing, TE RNG), _perf_precision() (bf16 / fp8_cs / fp8_mx / nvfp4)

Why Python, not YAML? Previous YAML-based approaches had problems: recipe logic was split across multiple indirection layers, configs were not self-contained, and the two-level pipeline made maintenance and debugging difficult. Python functions are explicit, greppable, and composable.

After #2803 lands, both library and perf recipes will be invocable through the same run_recipe.py entry point.


Library Recipe Index

All recipes live under src/megatron/bridge/recipes/. Each function returns a ConfigContainer with model, training, optimizer, and data settings.

Llama

Recipe Mode TP PP CP SP GPUs (min) Seq Len
llama2_7b_pretrain_config Pretrain 2 1 2 4K
llama3_8b_pretrain_config Pretrain 2 1 2 8K
llama3_8b_16k_pretrain_config Pretrain 2 1 2 4 16K
llama3_8b_64k_pretrain_config Pretrain 2 1 4 8 64K
llama3_8b_128k_pretrain_config Pretrain 2 1 8 16 128K
llama3_70b_pretrain_config Pretrain 8 4 32 8K
llama3_70b_16k_pretrain_config Pretrain 8 4 2 64 16K
llama3_70b_64k_pretrain_config Pretrain 8 4 4 128 64K
llama31_405b_pretrain_config Pretrain 8 16 128 8K
llama3_8b_sft_config SFT 2 1 2 8K
llama3_70b_sft_config SFT 4 4 16 8K
llama31_405b_sft_config SFT 8 8 64 8K
llama3_8b_peft_config PEFT 1 1 1 8K
llama3_70b_peft_config PEFT 2 4 8 8K
llama31_405b_peft_config PEFT 4 8 32 8K

Qwen2 / Qwen2.5

Recipe Mode TP PP Sizes
qwen2_*_{pretrain,sft,peft}_config All 1–8 1–4 500M, 1.5B, 7B, 14B, 32B, 72B
qwen25_*_{pretrain,sft,peft}_config All 1–8 1–4 500M, 1.5B, 3B, 7B, 14B, 32B, 72B

Qwen3 (Dense)

Recipe Mode TP PP CP Sizes
qwen3_*_pretrain_config Pretrain 1–8 1–2 600M–32B
qwen3_*_sft_config SFT 1–8 1–2 600M–32B
qwen3_600m_sft_128k_config SFT 1 1 8 600M (128K seq)
qwen3_*_peft_config PEFT 1 1 600M–32B

Qwen3 MoE

Recipe Mode TP PP EP CP GPUs
qwen3_30b_a3b_pretrain_config Pretrain 1 1 8 8
qwen3_30b_a3b_sft_config SFT 1 1 8 8
qwen3_30b_a3b_peft_config PEFT 1 1 1 1
qwen3_235b_a22b_pretrain_config Pretrain 4 16 8 2 512+
qwen3_235b_a22b_sft_config SFT 4 8 8 256
qwen3_235b_a22b_peft_config PEFT 1 4 4 16

Qwen3-Next

Recipe Mode TP PP EP
qwen3_next_80b_a3b_pretrain_config Pretrain 1 4 8
qwen3_next_80b_a3b_sft_config SFT 1 2 8
qwen3_next_80b_a3b_peft_config PEFT 1 1 4

DeepSeek

Recipe Mode TP PP EP GPUs
deepseek_v2_lite_pretrain_config Pretrain 1 1 8 8
deepseek_v2_pretrain_config Pretrain 1 4 32 128
deepseek_v3_pretrain_config Pretrain 2 16 64 2048
deepseek_v3_pretrain_config_32nodes Pretrain 2 8 32 256

GLM-4.5

Recipe Mode TP PP EP GPUs
glm45_355b_pretrain_config Pretrain 2 8 16 256
glm45_air_106b_pretrain_config Pretrain 1 4 8 32
glm45_355b_sft_config SFT 2 8 16 256
glm45_air_106b_sft_config SFT 1 4 8 32
glm45_355b_peft_config PEFT 2 4 4 32
glm45_air_106b_peft_config PEFT 1 2 4 8

Gemma

Recipe Mode TP PP Sizes
gemma2_*_{pretrain,sft,peft}_config All 2–8 1–2 2B, 9B, 27B
gemma3_1b_{pretrain,sft,peft}_config All 1 1 1B (32K seq)

NemotronH / Nemotron

Recipe Mode TP PP EP Notes
nemotronh_{4b,8b,47b,56b}_*_config P/S/PEFT 1–8 1–4 Dense SSM-hybrid
nemotron_3_nano_*_config P/S/PEFT varies 1 8 MoE + Mamba
nemotron_3_super_*_config P/S/PEFT 4 1 8 MoE + Mamba, ~40% CUDA graph gain
nemotron_nano_{9b,12b}_v2_*_config P/S/PEFT varies 1 Dense

Other Models

Recipe Mode Notes
moonlight_16b_{pretrain,sft,peft}_config All MoE EP=8
olmoe_7b_{pretrain,sft,peft}_config All MoE EP=8
ministral3_{3b,8b,14b}_{sft,peft}_config SFT/PEFT Dense
gpt_oss_20b_*_config All MoE + FP8/MXFP8 variants
gpt_oss_120b_*_config All MoE
vanilla_gpt_pretrain_config Pretrain MLM/Bridge parity baseline
gpt3_175b_pretrain_config Pretrain TP=4, PP=8, VP=6
kimi_k2_pretrain_config Pretrain 1T MoE, TP=2 PP=16 EP=32

VLM Recipes

Recipe Mode TP PP EP GPUs
gemma3_vl_{4b,12b,27b}_{sft,peft}_config SFT/PEFT 1–8 1–2 1–16
qwen25_vl_{3b,7b,32b,72b}_{sft,peft}_config SFT/PEFT 1–8 1–4 1–32
qwen3_vl_{8b,30b_a3b,235b_a22b}_{sft,peft}_config SFT/PEFT 1–4 1–8 1–32 1–512
qwen35_vl_*_{sft,peft}_config SFT/PEFT varies varies varies varies
glm_45v_{sft,peft}_config SFT/PEFT 1 8 4–16 64–512
nemotron_nano_v2_vl_12b_{sft,peft}_config SFT/PEFT 2–4 1 8

Diffusion Recipes

Recipe Mode TP CP
wan_1_3B_{pretrain,sft}_config P/SFT 1 8
wan_14B_{pretrain,sft}_config P/SFT 2 4
flux_12b_{pretrain,sft}_config P/SFT 2 1

Performance Recipe Index

All perf recipes live under scripts/performance/. They are invoked via run_script.py and use WorkloadBaseConfig presets per GPU type.

Important: Perf recipes are designed for upper-bound throughput benchmarks, not production training. They run 50 iterations on mock data by default. Throughput numbers are aspirational targets, not validated convergence configs.

Llama 3 / 3.1

Model GPUs GPU Types Key Features
Llama 3 8B 8 H100, B200, B300, GB200, GB300, R100 CUDA graphs (local), FSDP on GB variants
Llama 3 70B 64 H100, B200, B300, GB200, GB300 TP comm overlap (userbuffers), FSDP, CUDA graphs
Llama 3.1 405B 128–1024 H100, B200, B300, GB200, GB300 TP+CP comm overlap (userbuffers), FSDP, heavy PP/VP

SFT/LoRA variants also exist (e.g. 8B SFT with packed sequences, 70B SFT on 32 GPUs).

DeepSeek V3

Model GPUs GPU Types Key Features
DeepSeek V3 (671B MoE) 256–1024 H100, B200, B300, GB200, GB300 HybridEP dispatcher, MLA recompute, CUDA graphs (TE scoped)

Qwen3 MoE

Model GPUs GPU Types Key Features
Qwen3 30B-A3B 8–16 H100, B200, B300, GB200, GB300 MoE alltoall/flex dispatcher
Qwen3 235B-A22B 64–256 H100, B200, B300, GB200, GB300 TP comm overlap, CUDA graphs, MoE a2a overlap
Qwen3-Next 80B-A3B 64–128 H100, B200, B300, GB200, GB300 EP 64–128

Qwen3-VL

Model GPUs GPU Types Key Features
Qwen3-VL 30B-A3B 8–16 H100, B200, B300, GB200, GB300 VLM + MoE
Qwen3-VL 235B-A22B 64–256 H100, B200, B300, GB200, GB300 VLM + MoE, TP comm overlap

Kimi K2

Model GPUs GPU Types Key Features
Kimi K2 (1T MoE) 256–1024 H100, B200, B300, GB200, GB300 Muon/Adam optimizer, HybridEP, pipeline layout helpers

NemotronH

Model GPUs GPU Types Key Features
Nemotron 3 Nano (30B MoE+Mamba) 8–16 H100, B200, B300, GB200, GB300 TE CUDA graphs (attn+mamba+moe), HybridEP
Nemotron 3 Super 64 H100, B200, B300, GB200, GB300 TE CUDA graphs, EP=64
NemotronH 56B 64 H100, B200, B300 TP=2–8, TE graphs (mamba+attn)

GPT-OSS

Model GPUs GPU Types Key Features
GPT-OSS 120B 64 H100, B200, GB200 EP=64, HybridEP on GB200

Recommendation Decision Tree

User wants to train a model
├─ Know the model name?
│   ├─ Yes → Look up in Library Recipe Index above
│   │   ├─ Has a recipe for their size + mode? → Use it directly
│   │   └─ No exact match? → Use closest size, adjust parallelism
│   └─ No → Ask for model name, size, and HF model ID
├─ What's the training goal?
│   ├─ Pretrain → Use *_pretrain_config
│   ├─ SFT (full fine-tune) → Use *_sft_config
│   └─ PEFT (LoRA/DoRA) → Use *_peft_config (lowest GPU requirement)
├─ How many GPUs?
│   ├─ 1 GPU → Only PEFT recipes work (TP=1, PP=1)
│   ├─ 8 GPUs (1 node) → Most 8B–16B models, small MoE (EP=8)
│   ├─ 16–64 GPUs → 70B dense, medium MoE
│   └─ 128+ GPUs → 405B+, large MoE (DeepSeek V3, Kimi K2)
├─ Want throughput benchmarks?
│   ├─ Yes → Use perf recipes (scripts/performance/)
│   │   └─ ⚠️ These run on mock data for upper-bound perf only
│   └─ No → Use library recipes (scripts/training/run_recipe.py)
└─ Long context?
    ├─ > 8K → Need CP (context parallelism), check *_16k / *_64k / *_128k variants
    └─ ≤ 8K → Default recipes work

Adjustment Advice (When Recommending)

Parallelism Resizing Rules

When the user's GPU count differs from the recipe default:

  1. TP must divide num_key_value_heads (GQA constraint). E.g. if num_key_value_heads=8, valid TP = {1, 2, 4, 8}.
  2. TP should stay within a single node (NVLink). TP > 8 requires inter-node NVLink (e.g., GB200 NVL72).
  3. PP adds pipeline bubbles. Minimize PP; only increase when TP alone can't fit the model. Use VP (virtual pipeline) to mitigate bubble overhead.
  4. EP doesn't reduce dense-layer memory. Only expert parameters shard with EP. Shared attention/embeddings are replicated. For "OOM with MoE", increase EP first, not TP.
  5. SP should be True whenever TP > 1. It eliminates redundant activation copies and is essentially free.
  6. CP requires all-to-all or ring attention. Check cp_comm_type. For GQA models, a2a+p2p hierarchical CP allows CP > num_kv_heads.
  7. world_size = DP × TP × PP × CP × EP. DP is implicit. Make sure the product of explicit parallelisms divides your total GPU count.

Batch Size Tuning

  • Start with the recipe's micro_batch_size. If OOM, reduce to 1.
  • global_batch_size determines learning dynamics. Scale with DP: GBS = micro_batch_size × DP × gradient_accumulation_steps.
  • For MoE, micro_batch_size=1 is typical at scale.

Common Pitfalls to Warn About

Pitfall Symptom Fix
TP > num_kv_heads Crash: "TP must divide num_query_groups" Reduce TP to a divisor of num_kv_heads
PP without VP Poor throughput (large bubble) Set virtual_pipeline_model_parallel_size
EP too low for large MoE OOM on expert params Increase EP; each expert lives on EP/num_experts ranks
CUDA graphs + packed sequences Assert: "CUDA graph accepts only Tensor inputs" Disable packing or use local full-iteration graphs
CUDA graphs + full recompute Assert: "full recompute only with full iteration CUDA graph" Disable recompute or switch to local impl
use_te_rng_tracker not set Assert on provider init when CUDA graphs enabled Set cfg.model.use_te_rng_tracker = True and cfg.rng.te_rng_tracker = True
FSDP + TP > 1 on H100 Possible comm bottleneck Prefer FSDP with TP=1 or TP=2 on H100; FSDP shines on GB/B-series
Long context without CP OOM on activations Add CP=2/4/8; use *_16k, *_64k, or *_128k recipe variants
MoE overlap_grad_reduce on H100 May hurt perf (False in many H100 presets) Set overlap_grad_reduce=False for MoE on H100
VLM SFT missing image data Runs but produces garbage Provide actual multimodal dataset or use mock VLM data
Qwen35-VL MoE FSDP Tested on Blackwell only May not work on H100; validate first

Recipe Override Examples

# Scale Llama3 8B from 2 GPUs to 8 GPUs (increase DP)
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
    --recipe llama3_8b_pretrain_config \
    --dataset llm-pretrain-mock

# Reduce parallelism for Qwen3-MoE 30B to fit on 4 GPUs
uv run torchrun --nproc_per_node=4 scripts/training/run_recipe.py \
    --recipe qwen3_30b_a3b_sft_config \
    --dataset llm-finetune \
    'model.expert_model_parallel_size=4'

# Add long context to an existing recipe
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
    --recipe llama3_8b_pretrain_config \
    --dataset llm-pretrain-mock \
    'model.seq_length=32768' \
    'model.context_parallel_size=4'

# Enable CUDA graphs on any recipe
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
    --recipe qwen3_30b_a3b_pretrain_config \
    --dataset llm-pretrain-mock \
    'model.cuda_graph_impl=transformer_engine' \
    'model.cuda_graph_scope=[attn,moe_router,moe_preprocess]' \
    'model.use_te_rng_tracker=True' \
    'rng.te_rng_tracker=True'

Quick Reference: Which Recipe for My Situation?

I want to... Start with GPUs needed
Try Bridge for the first time llama3_8b_sft_config + mock data 2
Fine-tune a 7-8B model llama3_8b_sft_config or qwen3_8b_sft_config 2–8
LoRA on 1 GPU llama3_8b_peft_config or qwen3_8b_peft_config 1
Pretrain a dense 70B llama3_70b_pretrain_config 32–64
Train a small MoE qwen3_30b_a3b_pretrain_config 8
Train a large MoE (235B+) qwen3_235b_a22b_pretrain_config 256–512
Benchmark throughput Perf recipes via run_script.py Varies
Long-context training llama3_8b_128k_pretrain_config or add CP override 16+
VLM fine-tuning qwen3_vl_8b_sft_config or gemma3_vl_*_sft_config 4–8
Diffusion training wan_1_3B_pretrain_config or flux_12b_pretrain_config 8

Code Anchors

What Path
Library recipes root src/megatron/bridge/recipes/
Recipe __init__.py (all exports) src/megatron/bridge/recipes/__init__.py
Common recipe helpers src/megatron/bridge/recipes/common.py
Training entry point scripts/training/run_recipe.py
Perf recipes root scripts/performance/
Perf entry point scripts/performance/run_script.py
Perf workload configs scripts/performance/configs/<family>/
Perf overrides (benchmark defaults) scripts/performance/utils/overrides.py
Related skills

More from nvidia-nemo/megatron-bridge

Installs
1
GitHub Stars
577
First Seen
Apr 19, 2026