mlm-bridge-training
MLM vs Bridge Training
For how they differ, the arg mapping tables, gotchas, and translation script, see:
docs/megatron-lm-to-megatron-bridge.md
Correlation Testing
Use vanilla_gpt_pretrain_config for loss-correlation testing. This recipe uses
bare GPTModelProvider defaults (LayerNorm, GeLU, learned_absolute position
embeddings, vocab_size inherited from tokenizer) — matching MLM
pretrain_gpt.py defaults with no args.
MLM Correlation Run (2L/256H, 1 GPU)
PYTHONPATH=3rdparty/Megatron-LM:$PYTHONPATH \
uv run python -m torch.distributed.run --nproc_per_node=1 \
3rdparty/Megatron-LM/pretrain_gpt.py \
--num-layers 2 --hidden-size 256 --num-attention-heads 4 \
--ffn-hidden-size 1024 --seq-length 512 --max-position-embeddings 512 \
--micro-batch-size 4 --global-batch-size 32 \
--train-iters 10 --eval-iters 2 --eval-interval 10 \
--mock-data --bf16 --use-mcore-models \
--tokenizer-type NullTokenizer --vocab-size 32000 \
--lr 3e-4 --min-lr 3e-5 --seed 1234 --log-interval 1
Bridge Correlation Run (same config, 1 GPU)
rm -rf nemo_experiments && \
uv run python -m torch.distributed.run --nproc_per_node=1 \
scripts/training/run_recipe.py \
--recipe vanilla_gpt_pretrain_config \
model.num_layers=2 model.hidden_size=256 \
model.num_attention_heads=4 model.ffn_hidden_size=1024 \
model.seq_length=512 dataset.sequence_length=512 \
train.train_iters=10 train.global_batch_size=32 train.micro_batch_size=4 \
validation.eval_interval=10 validation.eval_iters=2 \
optimizer.lr=3e-4 optimizer.min_lr=3e-5 \
scheduler.lr_warmup_iters=1 scheduler.lr_decay_iters=10 \
rng.seed=1234 logger.log_interval=1
Verification
With matched parameters the LM losses should be nearly identical at each
iteration. Compare lm loss values from both logs — they should agree to
within BF16 rounding.
Multi-GPU Examples
MLM 2-GPU with TP=2
PYTHONPATH=3rdparty/Megatron-LM:$PYTHONPATH \
uv run python -m torch.distributed.run --nproc_per_node=2 \
3rdparty/Megatron-LM/pretrain_gpt.py \
--tensor-model-parallel-size 2 --sequence-parallel \
--num-layers 4 --hidden-size 256 --num-attention-heads 4 \
--seq-length 1024 --max-position-embeddings 1024 \
--micro-batch-size 2 --global-batch-size 16 \
--train-iters 10 --eval-iters 2 --eval-interval 10 \
--mock-data --bf16 --use-mcore-models \
--tokenizer-type NullTokenizer --vocab-size 1024 \
--lr 1e-4 --log-interval 1
Bridge 2-GPU with TP=2
rm -rf nemo_experiments && \
uv run python -m torch.distributed.run --nproc_per_node=2 \
scripts/training/run_recipe.py \
--recipe vanilla_gpt_pretrain_config \
model.tensor_model_parallel_size=2 model.sequence_parallel=true \
model.num_layers=4 model.hidden_size=256 \
model.num_attention_heads=4 model.ffn_hidden_size=1024 \
model.seq_length=1024 dataset.sequence_length=1024 \
train.train_iters=10 train.global_batch_size=16 train.micro_batch_size=2 \
validation.eval_interval=10 validation.eval_iters=2 \
scheduler.lr_warmup_iters=2 scheduler.lr_decay_iters=10 \
logger.log_interval=1
Available Recipes
Common recipes (use with --recipe):
vanilla_gpt_pretrain_config— Minimal GPT (bare GPTModelProvider defaults, ideal for correlation testing and custom configs)llama32_1b_pretrain_config— Llama 3.2 1B (16L, 2048H, GBS=512, seq=8192)llama3_8b_pretrain_config— Llama 3 8Bqwen3_8b_pretrain_config— Qwen3 8Bdeepseek_v2_lite_pretrain_config— DeepSeek-V2-Lite 16B MoE
SFT/PEFT variants use _sft_config / _peft_config suffix.
Megatron-Core Submodule
For what the submodule is and why two versions exist, see
docs/megatron-lm-to-megatron-bridge.md.
Check current version
./scripts/switch_mcore.sh status
Switch to dev for testing newer MCore features
./scripts/switch_mcore.sh dev
# uv sync (without --locked) since lockfile is for main
uv sync
Switch back to main
./scripts/switch_mcore.sh main
After pulling latest main
When you pull the latest Bridge main branch, the submodule pointer may have been updated. Re-sync the submodule:
git submodule update --init 3rdparty/Megatron-LM
Pitfalls
-
Always
rm -rf nemo_experimentsbefore a fresh correlation run. Bridge auto-resumes from stale checkpoints silently. -
uv runrequired: Always useuv run python -m torch.distributed.run(not baretorchrunorpython). -
MLM PYTHONPATH: Must include
3rdparty/Megatron-LMsogpt_builders.pyis importable. -
Scheduler overrides: When overriding
train.train_itersto a small value, also setscheduler.lr_warmup_itersandscheduler.lr_decay_itersor you get an assertion error. -
Use
dataset.sequence_lengthin CLI overrides, notdataset.seq_length. -
MoE OOM: Large MoE models require full activation recomputation and typically multi-node EP. TP does NOT reduce per-GPU expert memory.
-
uv sync --lockedfails after switching to dev: The lockfile is generated against the main MCore commit. Useuv sync(without--locked) when on dev.
More from nvidia-nemo/megatron-bridge
multi-node-slurm
Convert single-node scripts to multi-node Slurm sbatch jobs and debug common multi-node failures. Covers srun-native vs uv run torch.distributed approaches, container setup, NCCL timeouts, OOM sizing for MoE models, and interactive allocation. Use when creating Slurm scripts, scaling to multi-node, or debugging multi-node job failures.
1developer-guide
Developer environment setup, CI/CD workflows, and CI failure debugging for Megatron Bridge. Covers container-based development, uv package management, pre-commit hooks, running tests, CI failure investigation, and common pitfalls. Use when onboarding, setting up a dev environment, troubleshooting build issues, investigating CI failures, or dealing with lockfile issues (corrupted, regenerating, or updating uv.lock).
1parity-testing
Structured framework for verifying numerical parity of HF<->MCore weight conversions. References existing tools and the add-model-support skill. Use when debugging weight mismatches, verifying checkpoint round-trips, or choosing which verification tool to run.
1code-style
Code style and quality guidelines for Megatron Bridge. Covers naming, type hints, ruff enforcement, keyword-arg safety, copyright headers, logging, and common anti-patterns. Auto-invoked during code review and when writing new code.
1resiliency
Resiliency features in Megatron Bridge including fault tolerance, straggler detection, in-process restart, preemption, and re-run state machine. Use when the user asks about fault tolerance, straggler detection, hang detection, automatic restart, preemption, in-process restart, checkpoint recovery, or nvidia-resiliency-ext.
1adding-model-support
Guide for adding support for new LLM or VLM models in Megatron-Bridge. Covers bridge, provider, recipe, tests, docs, and examples. Use when the user asks to add, support, onboard, or integrate a new model, or when creating bridges, providers, or recipes for a new model family.
1