resiliency
Resiliency
Stable docs: docs/training/resiliency.md, docs/training/checkpointing.md
Card: card.yaml (co-located)
Enablement
Fault tolerance (Slurm only)
Option 1: NeMo Run plugin (recommended)
from megatron.bridge.recipes.run_plugins import FaultTolerancePlugin
import nemo_run as run
task = run.Script(...)
run_plugins = [
FaultTolerancePlugin(
enable_ft_package=True,
calc_ft_timeouts=True,
num_in_job_restarts=3,
num_job_retries_on_failure=2,
initial_rank_heartbeat_timeout=1800,
rank_heartbeat_timeout=300,
)
]
run.run(task, plugins=run_plugins, executor=executor)
| Plugin parameter | Default | Description |
|---|---|---|
num_in_job_restarts |
3 | Max restarts within same job |
num_job_retries_on_failure |
2 | Max new job launches on failure |
initial_rank_heartbeat_timeout |
1800 | First heartbeat timeout (seconds) |
rank_heartbeat_timeout |
300 | Subsequent heartbeat timeout (seconds) |
Option 2: Direct config + ft_launcher
from megatron.bridge.training.config import FaultToleranceConfig
cfg.ft = FaultToleranceConfig(
enable_ft_package=True,
calc_ft_timeouts=True,
simulate_fault=False,
simulated_fault_type="random",
)
Launch with ft_launcher (not torchrun):
export GROUP_RANK=0 # required for non-Slurm
ft_launcher \
--rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
--nnodes=${NUM_NODES} --nproc-per-node=${NUM_GPUS_PER_NODE} \
--ft-rank_section_timeouts=setup:600,step:180,checkpointing:420 \
--ft-rank_out_of_section_timeout=300 \
your_training_script.py
| Config parameter | Default | Description |
|---|---|---|
enable_ft_package |
False | Enable fault tolerance |
calc_ft_timeouts |
False | Auto-compute optimal timeouts |
simulate_fault |
False | Enable fault simulation for testing |
simulated_fault_type |
"random" |
"rank_hung", "rank_killed", or "random" |
simulated_fault_rank |
None | Specific rank to fault (random if None) |
simulated_fault_base_delay |
0 | Base delay before simulating fault |
Section-based timeout monitoring covers setup, training steps, checkpointing,
and out-of-section time independently. Timeouts are saved to ft_state.json
for subsequent runs when calc_ft_timeouts=True.
NVRx straggler detection
from megatron.bridge.training.config import NVRxStragglerDetectionConfig
cfg.nvrx_straggler = NVRxStragglerDetectionConfig(
enabled=True,
report_time_interval=300.0,
calc_relative_gpu_perf=True,
calc_individual_gpu_perf=True,
num_gpu_perf_scores_to_print=5,
gpu_relative_perf_threshold=0.7,
gpu_individual_perf_threshold=0.7,
stop_if_detected=False,
enable_logging=True,
)
| Parameter | Default | Description |
|---|---|---|
enabled |
False | Enable straggler detection |
report_time_interval |
300.0 | Seconds between straggler checks |
calc_relative_gpu_perf |
True | Compare ranks against each other |
calc_individual_gpu_perf |
True | Track per-rank degradation over time |
gpu_relative_perf_threshold |
0.7 | Threshold for relative performance (0-1) |
gpu_individual_perf_threshold |
0.7 | Threshold for individual performance (0-1) |
stop_if_detected |
False | Terminate training on straggler |
num_gpu_perf_scores_to_print |
5 | Number of best/worst scores to print |
profiling_interval |
1 | Profiling interval for detector |
Preemption
Plugin (Slurm)
from megatron.bridge.recipes.run_plugins import PreemptionPlugin
plugins = [
PreemptionPlugin(
preempt_time=60,
enable_exit_handler=True,
enable_exit_handler_for_data_loader=False,
)
]
| Plugin parameter | Default | Description |
|---|---|---|
preempt_time |
60 | Seconds before job limit to send signal |
enable_exit_handler |
True | Enable signal handler in training |
enable_exit_handler_for_data_loader |
False | Enable for dataloader workers |
Direct config
import signal
cfg.train.exit_signal_handler = True
cfg.train.exit_signal = signal.SIGTERM
cfg.train.exit_signal_handler_for_dataloader = False
Re-run state machine (experimental)
from megatron.bridge.training.config import RerunStateMachineConfig
cfg.rerun_state_machine = RerunStateMachineConfig(
rerun_mode="validate_results",
check_for_nan_in_loss=True,
check_for_spiky_loss=False,
spiky_loss_factor=10.0,
)
| Parameter | Default | Description |
|---|---|---|
rerun_mode |
"disabled" |
"disabled", "validate_results", "report_determinism_stats" |
check_for_nan_in_loss |
True | Check for NaN in loss |
check_for_spiky_loss |
False | Check for unexpectedly large loss |
spiky_loss_factor |
10.0 | Loss flagged if > factor * max observed (increase for large models) |
Exit codes: 16 = resume to disambiguate, 17 = failed validation.
In-process restart (experimental)
from megatron.bridge.training.config import InProcessRestartConfig
cfg.inprocess_restart = InProcessRestartConfig(
enabled=True,
granularity="node",
soft_timeout=60.0,
hard_timeout=90.0,
)
| Parameter | Default | Description |
|---|---|---|
enabled |
False | Enable in-process restart |
active_world_size |
None | Ranks executing workload (rest are warm reserves) |
granularity |
"node" |
"node" or "rank" restart granularity |
max_iterations |
None | Max restart attempts (None = unlimited) |
soft_timeout |
60.0 | Detect GIL-released hangs (seconds) |
hard_timeout |
90.0 | Force-terminate hung ranks (seconds) |
heartbeat_interval |
30.0 | Heartbeat interval (seconds) |
heartbeat_timeout |
60.0 | Missing heartbeat timeout (seconds) |
barrier_timeout |
120.0 | Distributed barrier timeout (seconds) |
completion_timeout |
120.0 | Completion barrier timeout (seconds) |
empty_cuda_cache |
True | Clear CUDA cache during restart |
max_rank_faults |
None | Max rank faults before terminating |
monitor_process_logdir |
None | Directory for monitor logs |
Required environment variables:
export TORCH_CPP_LOG_LEVEL=error
export TORCH_NCCL_RETHROW_CUDA_ERRORS=0
export NCCL_NVLS_ENABLE=0
The PyTorch NCCL watchdog timeout must exceed hard_timeout. NeMo-Run's
Slurm Executor is not supported; launch directly with srun --kill-on-bad-exit=0.
Async checkpoint save
cfg.checkpoint.async_save = True
cfg.checkpoint.ckpt_format = "torch_dist"
Local checkpointing (NVRx)
cfg.checkpoint.non_persistent_local_ckpt_dir = "/local/scratch/ckpt"
cfg.checkpoint.non_persistent_local_ckpt_algo = "fully_parallel"
Code Anchors
Fault tolerance
- Config:
src/megatron/bridge/training/config.py—FaultToleranceConfig - Runtime:
src/megatron/bridge/training/fault_tolerance.py - Plugin:
src/megatron/bridge/recipes/run_plugins.py—FaultTolerancePlugin - Perf plugin:
scripts/performance/resiliency_plugins.py - Tests:
tests/unit_tests/training/test_fault_tolerance.py - Example:
examples/resiliency/fault_tolerance/
Straggler detection
- Config:
src/megatron/bridge/training/config.py—NVRxStragglerDetectionConfig - Runtime:
src/megatron/bridge/training/nvrx_straggler.py - Train loop:
src/megatron/bridge/training/train.py—check_nvrx_straggler_detection - Tests:
tests/unit_tests/training/test_nvrx_straggler.py,tests/functional_tests/training/test_nvrx_straggler.py - Example:
examples/resiliency/straggler_detection/
In-process restart
- Config:
src/megatron/bridge/training/config.py—InProcessRestartConfig - Runtime:
src/megatron/bridge/training/inprocess_restart.py - Entry point:
src/megatron/bridge/training/pretrain.py—maybe_wrap_for_inprocess_restart - Tests:
tests/unit_tests/training/test_inprocess_restart.py,tests/functional_tests/training/test_inprocess_restart.py
Preemption
- Plugin:
src/megatron/bridge/recipes/run_plugins.py—PreemptionPlugin - Signal handler:
src/megatron/bridge/training/utils/sig_utils.py - Tests:
tests/unit_tests/recipes/test_run_plugins.py
Re-run state machine
- Config:
src/megatron/bridge/training/config.py—RerunStateMachineConfig - Init:
src/megatron/bridge/training/initialize.py—init_rerun_state
Checkpointing
- Async save:
src/megatron/bridge/training/checkpointing.py—schedule_async_save - Local ckpt:
src/megatron/bridge/training/checkpointing.py—LocalCheckpointManager - Tests:
tests/functional_tests/training/test_local_checkpointing.py
Pitfalls
-
ft_launcher, not torchrun: Direct
FaultToleranceConfigrequiresft_launcher. Usingtorchrunsilently disables FT. For non-Slurm, setGROUP_RANK=0. -
Async save requires torch_dist:
async_save=Trueonly works withckpt_format="torch_dist". Other formats silently fail or error. -
IPR + NeMo-Run: In-process restart is not compatible with NeMo-Run or Slurm preemption plugins. Requires specific PyTorch/NCCL versions and env vars.
-
NVRx vs legacy straggler: Two detectors exist. Use NVRx (
nvrx_straggler); do not enable both. -
stop_if_detected default: NVRx logs but does not stop training by default. Set
stop_if_detected=Truefor automatic termination. -
NCCL watchdog vs hard_timeout: For IPR, NCCL watchdog timeout must exceed
hard_timeoutor PyTorch kills the process before recovery. -
Rerun state machine is alpha: Use
check_for_nan_in_loss=Truefor NaN detection, but don't rely on full rerun workflows yet.
Verification
Fault tolerance
./examples/resiliency/fault_tolerance/run_fault_tolerance.sh
./examples/resiliency/fault_tolerance/run_fault_tolerance.sh --simulate-fault
Look for [FaultTolerance] / [RankMonitorServer] log lines with section
timeouts. Simulated fault should trigger restart from checkpoint.
Straggler detection
uv run python -m torch.distributed.run --nproc_per_node=2 \
examples/resiliency/straggler_detection/straggler_detection_example.py
Look for GPU relative performance and GPU individual performance reports
with per-rank scores.
Async checkpoint
Look for Scheduling async checkpoint save in logs. Training iterations
should continue while checkpoint files are being written.
In-process restart
pytest tests/functional_tests/training/test_inprocess_restart.py -v
Requires compatible PyTorch/NCCL versions.
More from nvidia-nemo/megatron-bridge
multi-node-slurm
Convert single-node scripts to multi-node Slurm sbatch jobs and debug common multi-node failures. Covers srun-native vs uv run torch.distributed approaches, container setup, NCCL timeouts, OOM sizing for MoE models, and interactive allocation. Use when creating Slurm scripts, scaling to multi-node, or debugging multi-node job failures.
1developer-guide
Developer environment setup, CI/CD workflows, and CI failure debugging for Megatron Bridge. Covers container-based development, uv package management, pre-commit hooks, running tests, CI failure investigation, and common pitfalls. Use when onboarding, setting up a dev environment, troubleshooting build issues, investigating CI failures, or dealing with lockfile issues (corrupted, regenerating, or updating uv.lock).
1parity-testing
Structured framework for verifying numerical parity of HF<->MCore weight conversions. References existing tools and the add-model-support skill. Use when debugging weight mismatches, verifying checkpoint round-trips, or choosing which verification tool to run.
1code-style
Code style and quality guidelines for Megatron Bridge. Covers naming, type hints, ruff enforcement, keyword-arg safety, copyright headers, logging, and common anti-patterns. Auto-invoked during code review and when writing new code.
1adding-model-support
Guide for adding support for new LLM or VLM models in Megatron-Bridge. Covers bridge, provider, recipe, tests, docs, and examples. Use when the user asks to add, support, onboard, or integrate a new model, or when creating bridges, providers, or recipes for a new model family.
1mlm-bridge-training
Run Megatron-LM (MLM) and Megatron Bridge training with mock or real data. Covers correlation testing, available recipes, and multi-GPU examples. Use when running training, comparing MLM vs Bridge, or translating configs.
1