Build & Dependency Guide

The core principle: build and develop inside containers — the CI container ships the correct CUDA toolkit, PyTorch build, and pre-compiled native extensions (TransformerEngine, DeepEP, …) that cannot be reproduced on a bare host.

Why Containers

Megatron-LM depends on CUDA, NCCL, PyTorch with GPU support, TransformerEngine, and optional components like ModelOpt and DeepEP. Installing these on a bare host is fragile and hard to reproduce. The project ships Dockerfiles that pin every dependency.

Use the container as your development environment. This guarantees:

Identical CUDA / NCCL / cuDNN versions across all developers and CI.
uv.lock resolves the same way locally and in CI.
GPU-dependent operations (training, testing) work out of the box.

dev vs lts

Two image variants exist, controlled by the IMAGE_TYPE build arg and the container::lts PR label:

Variant	Base image pin	uv group	When used
`dev`	`docker/.ngc_version.dev`	`dev`	Default — CI, local development, most PRs
`lts`	`docker/.ngc_version.lts`	`lts`	Stability testing; excludes ModelOpt and other bleeding-edge extras

Use dev for everything unless you have a specific reason to test lts. CI runs dev by default; attach container::lts to a PR only when verifying compatibility with the stable stack (e.g. a dependency upgrade that must not break LTS users). The @pytest.mark.flaky_in_dev marker skips tests in the dev environment; @pytest.mark.flaky skips them in lts.

Step 1 — Acquire an Image

Option A — NVIDIA-internal: pull a CI-built image

⚠️ Requires access to the internal GitLab instance. See @tools/trigger_internal_ci.md for setup (adding the git remote, obtaining a token).

The internal GitLab CI publishes images to its container registry. Derive the registry host from your configured gitlab remote — the same host you use for trigger_internal_ci.py:

# Derive host from your 'gitlab' remote:
GITLAB_HOST=$(git remote get-url gitlab | sed 's/.*@\(.*\):.*/\1/')

docker pull ${GITLAB_HOST}/adlr/megatron-lm/mcore_ci_dev:main

Option B — Build from scratch (works for everyone)

⚠️ Dockerfile.ci.dev has two stages: main and jet. The jet stage requires an internal build secret and will fail without it. Always pass --target main to stop at the public stage.

# dev image (default)
docker build \
  --target main \
  --build-arg FROM_IMAGE_NAME=$(cat docker/.ngc_version.dev) \
  --build-arg IMAGE_TYPE=dev \
  -f docker/Dockerfile.ci.dev \
  -t megatron-lm:local .

# lts image
docker build \
  --target main \
  --build-arg FROM_IMAGE_NAME=$(cat docker/.ngc_version.lts) \
  --build-arg IMAGE_TYPE=lts \
  -f docker/Dockerfile.ci.dev \
  -t megatron-lm:local-lts .

Which image variant is used is controlled by the PR label container::lts; absent that label, dev is used.

Step 2 — Launch the Container

Option A — Local Docker runtime

docker run --rm --gpus all \
  -v $(pwd):/workspace \
  -w /workspace \
  megatron-lm:local \
  bash -c "<your command>"

Option B — Slurm cluster (for those without a local Docker runtime)

NVIDIA clusters typically use Pyxis + enroot. Request an interactive session:

srun \
  --nodes=1 --gpus-per-node=8 \
  --container-image megatron-lm:local \
  --container-mounts $(pwd):/workspace \
  --container-workdir /workspace \
  --pty bash

For clusters that require a .sqsh archive first:

enroot import -o megatron-lm.sqsh dockerd://megatron-lm:local
srun \
  --nodes=1 --gpus-per-node=8 \
  --container-image $(pwd)/megatron-lm.sqsh \
  --container-mounts $(pwd):/workspace \
  --container-workdir /workspace \
  --pty bash

Dependency Management

Dependencies are declared in pyproject.toml. The venv lives at /opt/venv inside the container (already on PATH).

All uv operations must be run inside the container. Never run uv sync / uv pip install on the host.

uv Dependency Groups

Group	Purpose
`training`	Runtime training extras
`dev`	Full dev environment (TransformerEngine, ModelOpt, …)
`lts`	LTS-safe subset (no ModelOpt)
`test`	pytest, coverage, nemo-run
`linting`	ruff, black, isort, pylint
`build`	Cython, pybind11, nvidia-mathdx

Install commands (inside the container):

# Full dev + test environment
uv sync --locked --group dev --group test

# Linting only
uv sync --locked --only-group linting

# LTS environment
uv sync --locked --group lts --group test

Several dependencies are sourced directly from git (TransformerEngine, nemo-run, FlashMLA, Emerging-Optimizers, nvidia-resiliency-ext). The locked uv.lock file pins exact revisions; update it with uv lock when changing pyproject.toml.

Adding a New Dependency

Follow this three-step workflow:

Acquire a container image — see Step 1 above.
Launch the container interactively — see Step 2 above.

Update the lock file inside the container, then commit it:

# Inside the container:
uv add <package>          # adds to pyproject.toml and resolves
uv lock                   # regenerates uv.lock
# Exit the container, then on the host:
git add pyproject.toml uv.lock
git commit -S -s -m "build: add <package> dependency"

Resolving a merge conflict in uv.lock

uv.lock is machine-generated; never resolve conflicts manually. Instead:

git checkout origin/main -- uv.lock   # take main's version as the base
# then inside the container:
uv lock                               # re-resolve on top of your pyproject.toml changes

Common Pitfalls

Problem	Cause	Fix
`uv sync --locked` fails	Dependency conflict or stale `uv.lock`	Re-run `uv lock` inside the container and commit updated lock
`ModuleNotFoundError` after pip install	pip installed outside the uv-managed venv	Use `uv add` and `uv sync`, never bare `pip install`
`uv: command not found` inside container	Wrong container image	Use the `megatron-lm` image built from `Dockerfile.ci.dev`
`No space left on device` during uv ops	Cache fills container's `/root/.cache/`	Mount a host cache dir via `-v $HOME/.cache/uv:/root/.cache/uv`
`docker build` fails with secret-related error	`Dockerfile.ci.dev` has a `jet` stage that requires an internal secret	Add `--target main` to stop before the `jet` stage
`access forbidden` when pulling	Registry URL includes an explicit port (e.g. `:5005`)	Use `${GITLAB_HOST}/adlr/...` with no port — the sed extracts the hostname only

build-and-dependency

Build & Dependency Guide

Why Containers

dev vs lts

Step 1 — Acquire an Image

Step 2 — Launch the Container

Dependency Management

uv Dependency Groups

Adding a New Dependency

Resolving a merge conflict in uv.lock

Common Pitfalls

More from nvidia/megatron-lm

create-issue

cicd

run-on-slurm

nightly-sync

linting-and-formatting