recipe-recommender

Installation

SKILL.md

Auto Recipe — Recipe Index & Recommendation

This skill indexes every shipped recipe and helps users pick the right starting config, adjust parallelism, and avoid common pitfalls.

How to Use This Skill

Ask the user for: model name/size, GPU count & type, training goal (pretrain / SFT / PEFT), and sequence length (if non-default).
Look up the best-match recipe in the index below.
Recommend the recipe function name + entry-point command.
Provide adjustment advice (parallelism resizing, batch tuning, pitfalls).

Entry Points

Library recipes (functional training)

# Pretrain with mock data
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
    --recipe <recipe_function_name> \
    --dataset llm-pretrain-mock

# SFT with SQuAD
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
    --recipe <recipe_function_name> \
    --dataset llm-finetune

# Override any field via CLI
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
    --recipe llama3_8b_pretrain_config \
    --dataset llm-pretrain-mock \
    'model.tensor_model_parallel_size=2' \
    'training.global_batch_size=64'

Performance recipes (throughput benchmarks)

python scripts/performance/run_script.py \
    --recipe <model_family> \
    --gpu_type h100 \
    --num_gpus 64 \
    --data mock

Perf recipes are NOT fully validated for correctness. Most conversations and testing were on mock data. They are designed for upper-bound throughput measurement, not production training. Always validate loss curves and convergence independently.

Recipe Unification (Coming Soon — PR #2803)

PR #2803 is unifying performance recipes into the same Python function format used by library recipes. Key changes:

Perf recipes move from scripts/performance/configs/ → src/megatron/bridge/recipes/<family>/<model>_perf.py
Each perf recipe becomes a self-contained Python function (e.g. llama3_8b_h100_bf16_pretrain_config())
The old WorkloadBaseConfig → set_workload_base_configs → get_perf_optimized_recipe pipeline is removed
Shared helpers: _benchmark_common() (50 iters, timing, TE RNG), _perf_precision() (bf16 / fp8_cs / fp8_mx / nvfp4)

Why Python, not YAML? Previous YAML-based approaches had problems: recipe logic was split across multiple indirection layers, configs were not self-contained, and the two-level pipeline made maintenance and debugging difficult. Python functions are explicit, greppable, and composable.

After #2803 lands, both library and perf recipes will be invocable through the same run_recipe.py entry point.

Library Recipe Index

All recipes live under src/megatron/bridge/recipes/. Each function returns a ConfigContainer with model, training, optimizer, and data settings.

Llama

Recipe	Mode	TP	PP	CP	SP	GPUs (min)	Seq Len
`llama2_7b_pretrain_config`	Pretrain	2	1	—	—	2	4K
`llama3_8b_pretrain_config`	Pretrain	2	1	—	✓	2	8K
`llama3_8b_16k_pretrain_config`	Pretrain	2	1	2	✓	4	16K
`llama3_8b_64k_pretrain_config`	Pretrain	2	1	4	✓	8	64K
`llama3_8b_128k_pretrain_config`	Pretrain	2	1	8	✓	16	128K
`llama3_70b_pretrain_config`	Pretrain	8	4	—	✓	32	8K
`llama3_70b_16k_pretrain_config`	Pretrain	8	4	2	✓	64	16K
`llama3_70b_64k_pretrain_config`	Pretrain	8	4	4	✓	128	64K
`llama31_405b_pretrain_config`	Pretrain	8	16	—	✓	128	8K
`llama3_8b_sft_config`	SFT	2	1	—	✓	2	8K
`llama3_70b_sft_config`	SFT	4	4	—	✓	16	8K
`llama31_405b_sft_config`	SFT	8	8	—	✓	64	8K
`llama3_8b_peft_config`	PEFT	1	1	—	—	1	8K
`llama3_70b_peft_config`	PEFT	2	4	—	✓	8	8K
`llama31_405b_peft_config`	PEFT	4	8	—	✓	32	8K

Qwen2 / Qwen2.5

Recipe	Mode	TP	PP	Sizes
`qwen2_*_{pretrain,sft,peft}_config`	All	1–8	1–4	500M, 1.5B, 7B, 14B, 32B, 72B
`qwen25_*_{pretrain,sft,peft}_config`	All	1–8	1–4	500M, 1.5B, 3B, 7B, 14B, 32B, 72B

Qwen3 (Dense)

Recipe	Mode	TP	PP	CP	Sizes
`qwen3_*_pretrain_config`	Pretrain	1–8	1–2	—	600M–32B
`qwen3_*_sft_config`	SFT	1–8	1–2	—	600M–32B
`qwen3_600m_sft_128k_config`	SFT	1	1	8	600M (128K seq)
`qwen3_*_peft_config`	PEFT	1	1	—	600M–32B

Qwen3 MoE

Recipe	Mode	TP	PP	EP	CP	GPUs
`qwen3_30b_a3b_pretrain_config`	Pretrain	1	1	8	—	8
`qwen3_30b_a3b_sft_config`	SFT	1	1	8	—	8
`qwen3_30b_a3b_peft_config`	PEFT	1	1	1	—	1
`qwen3_235b_a22b_pretrain_config`	Pretrain	4	16	8	2	512+
`qwen3_235b_a22b_sft_config`	SFT	4	8	8	—	256
`qwen3_235b_a22b_peft_config`	PEFT	1	4	4	—	16

Qwen3-Next

Recipe	Mode	TP	PP	EP
`qwen3_next_80b_a3b_pretrain_config`	Pretrain	1	4	8
`qwen3_next_80b_a3b_sft_config`	SFT	1	2	8
`qwen3_next_80b_a3b_peft_config`	PEFT	1	1	4

DeepSeek

Recipe	Mode	TP	PP	EP	GPUs
`deepseek_v2_lite_pretrain_config`	Pretrain	1	1	8	8
`deepseek_v2_pretrain_config`	Pretrain	1	4	32	128
`deepseek_v3_pretrain_config`	Pretrain	2	16	64	2048
`deepseek_v3_pretrain_config_32nodes`	Pretrain	2	8	32	256

GLM-4.5

Recipe	Mode	TP	PP	EP	GPUs
`glm45_355b_pretrain_config`	Pretrain	2	8	16	256
`glm45_air_106b_pretrain_config`	Pretrain	1	4	8	32
`glm45_355b_sft_config`	SFT	2	8	16	256
`glm45_air_106b_sft_config`	SFT	1	4	8	32
`glm45_355b_peft_config`	PEFT	2	4	4	32
`glm45_air_106b_peft_config`	PEFT	1	2	4	8

Gemma

Recipe	Mode	TP	PP	Sizes
`gemma2_*_{pretrain,sft,peft}_config`	All	2–8	1–2	2B, 9B, 27B
`gemma3_1b_{pretrain,sft,peft}_config`	All	1	1	1B (32K seq)

NemotronH / Nemotron

Recipe	Mode	TP	PP	EP	Notes
`nemotronh_{4b,8b,47b,56b}_*_config`	P/S/PEFT	1–8	1–4	—	Dense SSM-hybrid
`nemotron_3_nano_*_config`	P/S/PEFT	varies	1	8	MoE + Mamba
`nemotron_3_super_*_config`	P/S/PEFT	4	1	8	MoE + Mamba, ~40% CUDA graph gain
`nemotron_nano_{9b,12b}_v2_*_config`	P/S/PEFT	varies	1	—	Dense

Other Models

Recipe	Mode	Notes
`moonlight_16b_{pretrain,sft,peft}_config`	All	MoE EP=8
`olmoe_7b_{pretrain,sft,peft}_config`	All	MoE EP=8
`ministral3_{3b,8b,14b}_{sft,peft}_config`	SFT/PEFT	Dense
`gpt_oss_20b_*_config`	All	MoE + FP8/MXFP8 variants
`gpt_oss_120b_*_config`	All	MoE
`vanilla_gpt_pretrain_config`	Pretrain	MLM/Bridge parity baseline
`gpt3_175b_pretrain_config`	Pretrain	TP=4, PP=8, VP=6
`kimi_k2_pretrain_config`	Pretrain	1T MoE, TP=2 PP=16 EP=32

VLM Recipes

Recipe	Mode	TP	PP	EP	GPUs
`gemma3_vl_{4b,12b,27b}_{sft,peft}_config`	SFT/PEFT	1–8	1–2	—	1–16
`qwen25_vl_{3b,7b,32b,72b}_{sft,peft}_config`	SFT/PEFT	1–8	1–4	—	1–32
`qwen3_vl_{8b,30b_a3b,235b_a22b}_{sft,peft}_config`	SFT/PEFT	1–4	1–8	1–32	1–512
`qwen35_vl_*_{sft,peft}_config`	SFT/PEFT	varies	varies	varies	varies
`glm_45v_{sft,peft}_config`	SFT/PEFT	1	8	4–16	64–512
`nemotron_nano_v2_vl_12b_{sft,peft}_config`	SFT/PEFT	2–4	1	—	8

Diffusion Recipes

Recipe	Mode	TP	CP
`wan_1_3B_{pretrain,sft}_config`	P/SFT	1	8
`wan_14B_{pretrain,sft}_config`	P/SFT	2	4
`flux_12b_{pretrain,sft}_config`	P/SFT	2	1

Performance Recipe Index

All perf recipes live under scripts/performance/. They are invoked via run_script.py and use WorkloadBaseConfig presets per GPU type.

Important: Perf recipes are designed for upper-bound throughput benchmarks, not production training. They run 50 iterations on mock data by default. Throughput numbers are aspirational targets, not validated convergence configs.

Llama 3 / 3.1

Model	GPUs	GPU Types	Key Features
Llama 3 8B	8	H100, B200, B300, GB200, GB300, R100	CUDA graphs (local), FSDP on GB variants
Llama 3 70B	64	H100, B200, B300, GB200, GB300	TP comm overlap (userbuffers), FSDP, CUDA graphs
Llama 3.1 405B	128–1024	H100, B200, B300, GB200, GB300	TP+CP comm overlap (userbuffers), FSDP, heavy PP/VP

SFT/LoRA variants also exist (e.g. 8B SFT with packed sequences, 70B SFT on 32 GPUs).

DeepSeek V3

Model	GPUs	GPU Types	Key Features
DeepSeek V3 (671B MoE)	256–1024	H100, B200, B300, GB200, GB300	HybridEP dispatcher, MLA recompute, CUDA graphs (TE scoped)

Qwen3 MoE

Model	GPUs	GPU Types	Key Features
Qwen3 30B-A3B	8–16	H100, B200, B300, GB200, GB300	MoE alltoall/flex dispatcher
Qwen3 235B-A22B	64–256	H100, B200, B300, GB200, GB300	TP comm overlap, CUDA graphs, MoE a2a overlap
Qwen3-Next 80B-A3B	64–128	H100, B200, B300, GB200, GB300	EP 64–128

Qwen3-VL

Model	GPUs	GPU Types	Key Features
Qwen3-VL 30B-A3B	8–16	H100, B200, B300, GB200, GB300	VLM + MoE
Qwen3-VL 235B-A22B	64–256	H100, B200, B300, GB200, GB300	VLM + MoE, TP comm overlap

Kimi K2

Model	GPUs	GPU Types	Key Features
Kimi K2 (1T MoE)	256–1024	H100, B200, B300, GB200, GB300	Muon/Adam optimizer, HybridEP, pipeline layout helpers

NemotronH

Model	GPUs	GPU Types	Key Features
Nemotron 3 Nano (30B MoE+Mamba)	8–16	H100, B200, B300, GB200, GB300	TE CUDA graphs (attn+mamba+moe), HybridEP
Nemotron 3 Super	64	H100, B200, B300, GB200, GB300	TE CUDA graphs, EP=64
NemotronH 56B	64	H100, B200, B300	TP=2–8, TE graphs (mamba+attn)

GPT-OSS

Model	GPUs	GPU Types	Key Features
GPT-OSS 120B	64	H100, B200, GB200	EP=64, HybridEP on GB200

Recommendation Decision Tree

User wants to train a model
│
├─ Know the model name?
│   ├─ Yes → Look up in Library Recipe Index above
│   │   ├─ Has a recipe for their size + mode? → Use it directly
│   │   └─ No exact match? → Use closest size, adjust parallelism
│   └─ No → Ask for model name, size, and HF model ID
│
├─ What's the training goal?
│   ├─ Pretrain → Use *_pretrain_config
│   ├─ SFT (full fine-tune) → Use *_sft_config
│   └─ PEFT (LoRA/DoRA) → Use *_peft_config (lowest GPU requirement)
│
├─ How many GPUs?
│   ├─ 1 GPU → Only PEFT recipes work (TP=1, PP=1)
│   ├─ 8 GPUs (1 node) → Most 8B–16B models, small MoE (EP=8)
│   ├─ 16–64 GPUs → 70B dense, medium MoE
│   └─ 128+ GPUs → 405B+, large MoE (DeepSeek V3, Kimi K2)
│
├─ Want throughput benchmarks?
│   ├─ Yes → Use perf recipes (scripts/performance/)
│   │   └─ ⚠️ These run on mock data for upper-bound perf only
│   └─ No → Use library recipes (scripts/training/run_recipe.py)
│
└─ Long context?
    ├─ > 8K → Need CP (context parallelism), check *_16k / *_64k / *_128k variants
    └─ ≤ 8K → Default recipes work

Adjustment Advice (When Recommending)

Parallelism Resizing Rules

When the user's GPU count differs from the recipe default:

TP must divide num_key_value_heads (GQA constraint). E.g. if num_key_value_heads=8, valid TP = {1, 2, 4, 8}.
TP should stay within a single node (NVLink). TP > 8 requires inter-node NVLink (e.g., GB200 NVL72).
PP adds pipeline bubbles. Minimize PP; only increase when TP alone can't fit the model. Use VP (virtual pipeline) to mitigate bubble overhead.
EP doesn't reduce dense-layer memory. Only expert parameters shard with EP. Shared attention/embeddings are replicated. For "OOM with MoE", increase EP first, not TP.
SP should be True whenever TP > 1. It eliminates redundant activation copies and is essentially free.
CP requires all-to-all or ring attention. Check cp_comm_type. For GQA models, a2a+p2p hierarchical CP allows CP > num_kv_heads.
world_size = DP × TP × PP × CP × EP. DP is implicit. Make sure the product of explicit parallelisms divides your total GPU count.

Batch Size Tuning

Start with the recipe's micro_batch_size. If OOM, reduce to 1.
global_batch_size determines learning dynamics. Scale with DP: GBS = micro_batch_size × DP × gradient_accumulation_steps.
For MoE, micro_batch_size=1 is typical at scale.

Common Pitfalls to Warn About

Pitfall	Symptom	Fix
TP > num_kv_heads	Crash: "TP must divide num_query_groups"	Reduce TP to a divisor of num_kv_heads
PP without VP	Poor throughput (large bubble)	Set `virtual_pipeline_model_parallel_size`
EP too low for large MoE	OOM on expert params	Increase EP; each expert lives on EP/num_experts ranks
CUDA graphs + packed sequences	Assert: "CUDA graph accepts only Tensor inputs"	Disable packing or use `local` full-iteration graphs
CUDA graphs + full recompute	Assert: "full recompute only with full iteration CUDA graph"	Disable recompute or switch to `local` impl
`use_te_rng_tracker` not set	Assert on provider init when CUDA graphs enabled	Set `cfg.model.use_te_rng_tracker = True` and `cfg.rng.te_rng_tracker = True`
FSDP + TP > 1 on H100	Possible comm bottleneck	Prefer FSDP with TP=1 or TP=2 on H100; FSDP shines on GB/B-series
Long context without CP	OOM on activations	Add CP=2/4/8; use `_16k`, `_64k`, or `*_128k` recipe variants
MoE `overlap_grad_reduce` on H100	May hurt perf (False in many H100 presets)	Set `overlap_grad_reduce=False` for MoE on H100
VLM SFT missing image data	Runs but produces garbage	Provide actual multimodal dataset or use mock VLM data
Qwen35-VL MoE FSDP	Tested on Blackwell only	May not work on H100; validate first

Recipe Override Examples

# Scale Llama3 8B from 2 GPUs to 8 GPUs (increase DP)
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
    --recipe llama3_8b_pretrain_config \
    --dataset llm-pretrain-mock

# Reduce parallelism for Qwen3-MoE 30B to fit on 4 GPUs
uv run torchrun --nproc_per_node=4 scripts/training/run_recipe.py \
    --recipe qwen3_30b_a3b_sft_config \
    --dataset llm-finetune \
    'model.expert_model_parallel_size=4'

# Add long context to an existing recipe
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
    --recipe llama3_8b_pretrain_config \
    --dataset llm-pretrain-mock \
    'model.seq_length=32768' \
    'model.context_parallel_size=4'

# Enable CUDA graphs on any recipe
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
    --recipe qwen3_30b_a3b_pretrain_config \
    --dataset llm-pretrain-mock \
    'model.cuda_graph_impl=transformer_engine' \
    'model.cuda_graph_scope=[attn,moe_router,moe_preprocess]' \
    'model.use_te_rng_tracker=True' \
    'rng.te_rng_tracker=True'

Quick Reference: Which Recipe for My Situation?

I want to...	Start with	GPUs needed
Try Bridge for the first time	`llama3_8b_sft_config` + mock data	2
Fine-tune a 7-8B model	`llama3_8b_sft_config` or `qwen3_8b_sft_config`	2–8
LoRA on 1 GPU	`llama3_8b_peft_config` or `qwen3_8b_peft_config`	1
Pretrain a dense 70B	`llama3_70b_pretrain_config`	32–64
Train a small MoE	`qwen3_30b_a3b_pretrain_config`	8
Train a large MoE (235B+)	`qwen3_235b_a22b_pretrain_config`	256–512
Benchmark throughput	Perf recipes via `run_script.py`	Varies
Long-context training	`llama3_8b_128k_pretrain_config` or add CP override	16+
VLM fine-tuning	`qwen3_vl_8b_sft_config` or `gemma3_vl_*_sft_config`	4–8
Diffusion training	`wan_1_3B_pretrain_config` or `flux_12b_pretrain_config`	8

Code Anchors

What	Path
Library recipes root	`src/megatron/bridge/recipes/`
Recipe `__init__.py` (all exports)	`src/megatron/bridge/recipes/__init__.py`
Common recipe helpers	`src/megatron/bridge/recipes/common.py`
Training entry point	`scripts/training/run_recipe.py`
Perf recipes root	`scripts/performance/`
Perf entry point	`scripts/performance/run_script.py`
Perf workload configs	`scripts/performance/configs/<family>/`
Perf overrides (benchmark defaults)	`scripts/performance/utils/overrides.py`

Related skills

More from nvidia-nemo/megatron-bridge

Installs

Repository

nvidia-nemo/meg…n-bridge

GitHub Stars

577

First Seen

Apr 19, 2026

recipe-recommender

Auto Recipe — Recipe Index & Recommendation

How to Use This Skill

Entry Points

Library recipes (functional training)

Performance recipes (throughput benchmarks)

Recipe Unification (Coming Soon — PR #2803)

Library Recipe Index

Llama

Qwen2 / Qwen2.5

Qwen3 (Dense)

Qwen3 MoE

Qwen3-Next

DeepSeek

GLM-4.5

Gemma

NemotronH / Nemotron

Other Models

VLM Recipes

Diffusion Recipes

Performance Recipe Index

Llama 3 / 3.1

DeepSeek V3

Qwen3 MoE

Qwen3-VL

Kimi K2

NemotronH

GPT-OSS

Recommendation Decision Tree

Adjustment Advice (When Recommending)

Parallelism Resizing Rules

Batch Size Tuning

Common Pitfalls to Warn About

Recipe Override Examples

Quick Reference: Which Recipe for My Situation?

Code Anchors

More from nvidia-nemo/megatron-bridge

multi-node-slurm

developer-guide

parity-testing

code-style

mlm-bridge-training