adding-model-support
Adding New Model Support in Megatron-Bridge
Phase 1: Discovery
Step 1 — Get the HF model link
Ask the user for the HuggingFace model link (e.g. https://huggingface.co/Qwen/Qwen3.5-VL-27B).
If the model is not public, ask the user to provide the config.json file directly.
Step 2 — Fetch and analyze config.json
Read the model's config.json from HuggingFace (or from the user-provided file). Key fields to extract:
model_type— used for@register_bridge(model_type=...)architectures— the HF model class name (used forsource=...in registration)tie_word_embeddings— critical for weight tying- Architecture fields:
num_hidden_layers,hidden_size,intermediate_size,num_attention_heads,num_key_value_heads,vocab_size,max_position_embeddings,rope_theta, etc. - MoE fields (if present):
num_local_experts,num_experts_per_tok,moe_intermediate_size
More from nvidia-nemo/megatron-bridge
multi-node-slurm
Convert single-node scripts to multi-node Slurm sbatch jobs and debug common multi-node failures. Covers srun-native vs uv run torch.distributed approaches, container setup, NCCL timeouts, OOM sizing for MoE models, and interactive allocation. Use when creating Slurm scripts, scaling to multi-node, or debugging multi-node job failures.
1developer-guide
Developer environment setup, CI/CD workflows, and CI failure debugging for Megatron Bridge. Covers container-based development, uv package management, pre-commit hooks, running tests, CI failure investigation, and common pitfalls. Use when onboarding, setting up a dev environment, troubleshooting build issues, investigating CI failures, or dealing with lockfile issues (corrupted, regenerating, or updating uv.lock).
1parity-testing
Structured framework for verifying numerical parity of HF<->MCore weight conversions. References existing tools and the add-model-support skill. Use when debugging weight mismatches, verifying checkpoint round-trips, or choosing which verification tool to run.
1code-style
Code style and quality guidelines for Megatron Bridge. Covers naming, type hints, ruff enforcement, keyword-arg safety, copyright headers, logging, and common anti-patterns. Auto-invoked during code review and when writing new code.
1resiliency
Resiliency features in Megatron Bridge including fault tolerance, straggler detection, in-process restart, preemption, and re-run state machine. Use when the user asks about fault tolerance, straggler detection, hang detection, automatic restart, preemption, in-process restart, checkpoint recovery, or nvidia-resiliency-ext.
1mlm-bridge-training
Run Megatron-LM (MLM) and Megatron Bridge training with mock or real data. Covers correlation testing, available recipes, and multi-GPU examples. Use when running training, comparing MLM vs Bridge, or translating configs.
1