build-and-dependency
Build & Dependency Guide
The core principle: build and develop inside containers — the CI container ships the correct CUDA toolkit, PyTorch build, and pre-compiled native extensions (TransformerEngine, DeepEP, …) that cannot be reproduced on a bare host.
Why Containers
Megatron-LM depends on CUDA, NCCL, PyTorch with GPU support, TransformerEngine, and optional components like ModelOpt and DeepEP. Installing these on a bare host is fragile and hard to reproduce. The project ships Dockerfiles that pin every dependency.
Use the container as your development environment. This guarantees:
- Identical CUDA / NCCL / cuDNN versions across all developers and CI.
uv.lockresolves the same way locally and in CI.- GPU-dependent operations (training, testing) work out of the box.
dev vs lts
Two image variants exist, controlled by the IMAGE_TYPE build arg and the
container::lts PR label:
| Variant | Base image pin | uv group | When used |
|---|---|---|---|
dev |
docker/.ngc_version.dev |
dev |
Default — CI, local development, most PRs |
lts |
docker/.ngc_version.lts |
lts |
Stability testing; excludes ModelOpt and other bleeding-edge extras |
Use dev for everything unless you have a specific reason to test lts.
CI runs dev by default; attach container::lts to a PR only when verifying
compatibility with the stable stack (e.g. a dependency upgrade that must not
break LTS users). The @pytest.mark.flaky_in_dev marker skips tests in the
dev environment; @pytest.mark.flaky skips them in lts.
Step 1 — Acquire an Image
Option A — NVIDIA-internal: pull a CI-built image
⚠️ Requires access to the internal GitLab instance. See @tools/trigger_internal_ci.md for setup (adding the git remote, obtaining a token).
The internal GitLab CI publishes images to its container registry.
Derive the registry host from your configured gitlab remote — the same
host you use for trigger_internal_ci.py:
# Derive host from your 'gitlab' remote:
GITLAB_HOST=$(git remote get-url gitlab | sed 's/.*@\(.*\):.*/\1/')
docker pull ${GITLAB_HOST}/adlr/megatron-lm/mcore_ci_dev:main
Option B — Build from scratch (works for everyone)
⚠️
Dockerfile.ci.devhas two stages:mainandjet. Thejetstage requires an internal build secret and will fail without it. Always pass--target mainto stop at the public stage.
# dev image (default)
docker build \
--target main \
--build-arg FROM_IMAGE_NAME=$(cat docker/.ngc_version.dev) \
--build-arg IMAGE_TYPE=dev \
-f docker/Dockerfile.ci.dev \
-t megatron-lm:local .
# lts image
docker build \
--target main \
--build-arg FROM_IMAGE_NAME=$(cat docker/.ngc_version.lts) \
--build-arg IMAGE_TYPE=lts \
-f docker/Dockerfile.ci.dev \
-t megatron-lm:local-lts .
Which image variant is used is controlled by the PR label container::lts;
absent that label, dev is used.
Step 2 — Launch the Container
Option A — Local Docker runtime
docker run --rm --gpus all \
-v $(pwd):/workspace \
-w /workspace \
megatron-lm:local \
bash -c "<your command>"
Option B — Slurm cluster (for those without a local Docker runtime)
NVIDIA clusters typically use Pyxis + enroot. Request an interactive session:
srun \
--nodes=1 --gpus-per-node=8 \
--container-image megatron-lm:local \
--container-mounts $(pwd):/workspace \
--container-workdir /workspace \
--pty bash
For clusters that require a .sqsh archive first:
enroot import -o megatron-lm.sqsh dockerd://megatron-lm:local
srun \
--nodes=1 --gpus-per-node=8 \
--container-image $(pwd)/megatron-lm.sqsh \
--container-mounts $(pwd):/workspace \
--container-workdir /workspace \
--pty bash
Dependency Management
Dependencies are declared in pyproject.toml. The venv lives at /opt/venv
inside the container (already on PATH).
All
uvoperations must be run inside the container. Never runuv sync/uv pip installon the host.
uv Dependency Groups
| Group | Purpose |
|---|---|
training |
Runtime training extras |
dev |
Full dev environment (TransformerEngine, ModelOpt, …) |
lts |
LTS-safe subset (no ModelOpt) |
test |
pytest, coverage, nemo-run |
linting |
ruff, black, isort, pylint |
build |
Cython, pybind11, nvidia-mathdx |
Install commands (inside the container):
# Full dev + test environment
uv sync --locked --group dev --group test
# Linting only
uv sync --locked --only-group linting
# LTS environment
uv sync --locked --group lts --group test
Several dependencies are sourced directly from git (TransformerEngine, nemo-run,
FlashMLA, Emerging-Optimizers, nvidia-resiliency-ext). The locked uv.lock file
pins exact revisions; update it with uv lock when changing pyproject.toml.
Adding a New Dependency
Follow this three-step workflow:
-
Acquire a container image — see Step 1 above.
-
Launch the container interactively — see Step 2 above.
-
Update the lock file inside the container, then commit it:
# Inside the container: uv add <package> # adds to pyproject.toml and resolves uv lock # regenerates uv.lock # Exit the container, then on the host: git add pyproject.toml uv.lock git commit -S -s -m "build: add <package> dependency"
Resolving a merge conflict in uv.lock
uv.lock is machine-generated; never resolve conflicts manually. Instead:
git checkout origin/main -- uv.lock # take main's version as the base
# then inside the container:
uv lock # re-resolve on top of your pyproject.toml changes
Common Pitfalls
| Problem | Cause | Fix |
|---|---|---|
uv sync --locked fails |
Dependency conflict or stale uv.lock |
Re-run uv lock inside the container and commit updated lock |
ModuleNotFoundError after pip install |
pip installed outside the uv-managed venv | Use uv add and uv sync, never bare pip install |
uv: command not found inside container |
Wrong container image | Use the megatron-lm image built from Dockerfile.ci.dev |
No space left on device during uv ops |
Cache fills container's /root/.cache/ |
Mount a host cache dir via -v $HOME/.cache/uv:/root/.cache/uv |
docker build fails with secret-related error |
Dockerfile.ci.dev has a jet stage that requires an internal secret |
Add --target main to stop before the jet stage |
access forbidden when pulling |
Registry URL includes an explicit port (e.g. :5005) |
Use ${GITLAB_HOST}/adlr/... with no port — the sed extracts the hostname only |
More from nvidia/megatron-lm
create-issue
Investigate a failing GitHub Actions run or job and create a GitHub issue for the failure.
2cicd
CI/CD reference for Megatron-LM. Covers CI pipeline structure, PR scope labels, triggering internal GitLab CI, and CI failure investigation.
1run-on-slurm
How to launch distributed Megatron-LM training jobs on a SLURM cluster. Covers a minimal sbatch skeleton, environment-variable setup for torch.distributed.run, CUDA_DEVICE_MAX_CONNECTIONS rules across hardware and parallelism modes, container conventions, monitoring, and per-rank failure diagnosis.
1nightly-sync
Domain knowledge for the nightly main-to-dev sync workflow. Covers merge strategy, CI architecture, failure investigation, and known issues.
1linting-and-formatting
Linting and formatting for Megatron-LM. Covers running autoformat.sh, tools (ruff, black, isort, pylint, mypy), and code style rules.
1