install-stack
Install Multi-Chip Software Stack
Install 5 packages inside a running GPU container, in dependency order, with per-package validation and structured error reporting.
Skill Components
install-stack/
├── SKILL.md # This file — execution flow
├── scripts/
│ ├── detect_network.py # Probe GitHub/PyPI, return mirror config (JSON)
│ ├── collect_env_info.py # Python/glibc/arch/vendor/disk info (JSON)
│ ├── select_flagtree_wheel.py # Match vendor+python+glibc → wheel specifier (JSON)
│ └── validate_packages.py # Import-test all 5 packages, report status (JSON)
└── references/
├── vendor-mappings.md # FlagCX make flags, adaptor names, dependency chain
└── network-mirrors.md # GitHub/PyPI mirror config rules
Prerequisites
- A running Docker container with PyTorch + GPU access (from
/gpu-container-setup) - Know the container name and GPU vendor
If invoked standalone, ask the user for container name and GPU vendor.
If invoked from /flagrelease orchestrator, these are passed as context.
Execution Flow
Step 0: Resolve Container & Vendor
Verify the container is running:
docker inspect --format='{{.State.Status}}' <CONTAINER> | grep -q running
Copy and run scripts/collect_env_info.py inside the container to get vendor,
Python version, glibc version, architecture, and free disk space:
docker cp <SKILL_DIR>/scripts/collect_env_info.py <CONTAINER>:/tmp/
docker exec <CONTAINER> python3 /tmp/collect_env_info.py
If vendor is unknown and user didn't provide --vendor, ask the user.
Step 1: Detect Network Environment
Copy and run scripts/detect_network.py inside the container:
docker cp <SKILL_DIR>/scripts/detect_network.py <CONTAINER>:/tmp/
docker exec <CONTAINER> python3 /tmp/detect_network.py
Parse the JSON output to get GITHUB_PREFIX and PIP_INDEX for all subsequent
commands. See references/network-mirrors.md for fallback rules.
Step 2: Check Disk Space
From collect_env_info.py output, verify at least 10GB free. If not, warn user
and ask whether to proceed.
Step 3: Install Packages (in dependency order)
See references/vendor-mappings.md for dependency chain and install order:
vLLM → FlagTree → FlagGems → FlagCX → vllm-plugin-FL
3.1: vLLM 0.13.0
docker exec <CONTAINER> pip install ${PIP_INDEX} vllm==0.13.0
Quick validate:
docker exec <CONTAINER> python3 -c "import vllm; assert vllm.__version__ == '0.13.0'"
GATE: If vLLM install fails → record error and EXIT the skill.
3.2: FlagTree (pre-compiled wheel)
Run scripts/select_flagtree_wheel.py to find the correct wheel:
python3 <SKILL_DIR>/scripts/select_flagtree_wheel.py \
--vendor <VENDOR> --python <PY_VER> --glibc <GLIBC_VER>
If status is FOUND, uninstall stock triton and install the wheel:
docker exec <CONTAINER> bash -c '
python3 -m pip uninstall -y triton
python3 -m pip uninstall -y triton
python3 -m pip install <SPECIFIER> <PIP_ARGS>
'
If status is NOT_FOUND, record the mismatch and continue (do not exit).
3.3: FlagGems
docker exec <CONTAINER> bash -c "
cd /tmp && git clone ${GITHUB_PREFIX}/FlagOpen/FlagGems
cd FlagGems && pip install ${PIP_INDEX} -e .
"
Failure → record and continue.
3.4: FlagCX (two-phase build)
Read references/vendor-mappings.md to look up the correct Make flag and
FLAGCX_ADAPTOR for the detected vendor.
Phase 1: Build C++ library:
docker exec <CONTAINER> bash -c "
cd /tmp && git clone ${GITHUB_PREFIX}/flagos-ai/FlagCX
cd FlagCX && git submodule update --init --recursive
make <MAKE_FLAG> -j\$(nproc)
"
Phase 2: Install PyTorch plugin:
docker exec <CONTAINER> bash -c "
cd /tmp/FlagCX/plugin/torch
FLAGCX_ADAPTOR=<ADAPTOR> pip install -e . --no-build-isolation
"
Failure → record and continue.
3.5: vllm-plugin-FL
docker exec <CONTAINER> bash -c "
cd /tmp && git clone ${GITHUB_PREFIX}/flagos-ai/vllm-plugin-FL
cd vllm-plugin-FL
pip install ${PIP_INDEX} -r requirements.txt
pip install --no-build-isolation -e .
"
On Iluvatar, if requirements.txt fails, retry with requirements_iluvatar.txt.
GATE: If vllm-plugin-FL fails → record error and EXIT the skill.
Step 4: Validate All Packages
Copy and run scripts/validate_packages.py inside the container:
docker cp <SKILL_DIR>/scripts/validate_packages.py <CONTAINER>:/tmp/
docker exec <CONTAINER> python3 /tmp/validate_packages.py
This produces a comprehensive JSON report of all 5 packages with import status, versions, and gate check.
Step 5: Set Runtime Environment
If FlagCX installed successfully, persist FLAGCX_PATH:
docker exec <CONTAINER> bash -c "echo 'export FLAGCX_PATH=/tmp/FlagCX' >> ~/.bashrc"
Step 6: Produce Final Report
Combine all results into structured output:
{
"status": "PASS | PARTIAL | FAIL",
"stage": "install-stack",
"container": "<name>",
"vendor": "<vendor>",
"network": {"github_mirror": true, "pypi_mirror": true},
"python_version": "3.11",
"glibc_version": "2.34",
"packages": {
"vllm": {"status": "PASS", "version": "0.13.0"},
"flagtree": {"status": "PASS", "version": "0.4.1+ascend3.2"},
"flaggems": {"status": "PASS", "version": "..."},
"flagcx": {"status": "PASS", "version": "0.10.0"},
"vllm_plugin_fl": {"status": "PASS", "version": "..."}
},
"flagcx_path": "/tmp/FlagCX",
"gate_passed": true,
"errors": []
}
Status logic:
PASS— all 5 packages installed and validatedPARTIAL— vLLM + plugin installed, some of FlagTree/FlagGems/FlagCX failedFAIL— vLLM or plugin failed (gate failed)
Error Handling
| Failure | Behavior |
|---|---|
| Container not running | Report error, exit |
| Network unreachable (both direct and mirror) | Report, exit |
| pip install fails | Report package, full error, continue (unless gate) |
| Build from source fails | Report compiler error, continue |
| No matching FlagTree wheel | Report vendor/python/glibc mismatch, continue |
| Import fails after install | Report traceback, continue |
| Disk space < 10GB | Warn user, ask whether to proceed |
| Timeout (any command) | All commands use timeout; report which step |
Timeout Rules
| Operation | Timeout |
|---|---|
| pip install (per package) | 300s |
| git clone | 120s |
| make (FlagCX) | 300s |
| Network probe | 5s |
| Validation (import) | 30s |