flagrelease
FlagRelease Pipeline Orchestrator
End-to-end LLM deployment + testing pipeline for multi-chip GPU backends. Orchestrates 4 sub-skills in sequence and produces a final report.
Skill Components
flagrelease/
├── SKILL.md # This file — orchestration flow
└── references/
└── pipeline-state.md # Pipeline state schema, gate logic, data flow
Sub-skills (each independently invokable):
../install-stack/ # Step 2: Install 5 packages
│ ├── SKILL.md
│ ├── scripts/
│ │ ├── detect_network.py # Probe GitHub/PyPI, return mirror config
│ │ ├── collect_env_info.py # Python/glibc/arch/vendor/disk info
│ │ ├── select_flagtree_wheel.py # Match vendor+python+glibc → wheel
│ │ └── validate_packages.py # Import-test all 5 packages
│ └── references/
│ ├── vendor-mappings.md # FlagCX make flags, adaptor names
│ └── network-mirrors.md # Mirror config rules
../env-verify/ # Step 3: Qwen3-0.6B smoke test
│ ├── SKILL.md
│ ├── scripts/
│ │ ├── run_offline_inference.py # Phase A: offline inference test
│ │ └── test_serve_mode.py # Phase B: serve + health + chat test
│ └── references/
│ └── error-classification.md # Layer-based error classification
../model-verify/ # Step 4: Target model ± multi-chip
│ ├── SKILL.md
│ ├── scripts/
│ │ └── diff_analysis.py # Compare Run A vs Run B results
│ └── references/
│ └── multichip-errors.md # Multi-chip error patterns
../perf-test/ # Steps 5+6: Accuracy + Performance
│ ├── SKILL.md
│ ├── scripts/
│ │ ├── run_benchmark.py # Run single benchmark profile
│ │ └── run_all_benchmarks.py # Run all profiles + summarize
│ └── references/
│ └── benchmark-profiles.md # Profile definitions and metrics
Pipeline Overview
[Prerequisite: /gpu-container-setup already done by another team]
│
▼
install-stack → Install 5 packages (vLLM, FlagTree, FlagGems, FlagCX, plugin)
│ scripts: detect_network, collect_env_info, select_flagtree_wheel
│
│ GATE: vLLM + plugin must succeed
▼
env-verify → Smoke test with Qwen3-0.6B (FlagGems/CX OFF)
│ scripts: run_offline_inference, test_serve_mode
│
│ Verify Layers 0-3
▼
model-verify → Target model test (OFF then ON), diff analysis
│ scripts: run_offline_inference, test_serve_mode, diff_analysis
│
│ Determine which stack works (full vs base)
▼
perf-test → Accuracy (placeholder) + Performance benchmarks
│ scripts: run_benchmark, run_all_benchmarks
▼
Final Report
Prerequisites
A running Docker container with:
- PyTorch installed and GPU-accessible
- Container name known (e.g.
flagrelease-worker)
This container is produced by /gpu-container-setup (maintained by another team).
Execution Flow
Read references/pipeline-state.md for the full state schema and gate logic.
Step 0: Gather Initial Context
Ask user for container name (or detect running containers):
docker ps --format '{{.Names}}' | head -10
Verify the container is running:
docker inspect --format='{{.State.Status}}' <CONTAINER> | grep -q running
Initialize pipeline state (see references/pipeline-state.md).
Step 1: Install Software Stack
Read and follow ../install-stack/SKILL.md.
The install-stack skill will:
- Copy
scripts/collect_env_info.pyinto container → get vendor, Python, glibc - Copy
scripts/detect_network.pyinto container → get mirror config - Install 5 packages in order, using
scripts/select_flagtree_wheel.pyfor FlagTree - Run
scripts/validate_packages.pyinside container → get final status
Gate check: If gate_passed is false (vLLM or plugin failed) → STOP pipeline.
Report FAIL with install errors.
Store result in pipeline state.
Step 2: Environment Verification
Read and follow ../env-verify/SKILL.md.
The env-verify skill will:
- Download Qwen3-0.6B (if not cached)
- Copy
scripts/run_offline_inference.pyinto container → Phase A - Copy
scripts/test_serve_mode.pyinto container → Phase B - Classify errors using
references/error-classification.md
Decision: Fatal error → STOP. Non-fatal → record and continue.
Store result in pipeline state.
Step 3: Model Verification
Read and follow ../model-verify/SKILL.md.
This step is interactive — will ask user for model path.
The model-verify skill will:
- Get model info from user (AskUserQuestion)
- Reuse
run_offline_inference.pyandtest_serve_mode.pyfor Run A and Run B - Run
scripts/diff_analysis.pyto compare results - Determine
recommended_stack(full/base/none)
Decision: If recommended_stack is none (Run A failed) → STOP.
Store result in pipeline state (including model_path, tp_size, recommended_stack).
Step 4: Performance Test
Read and follow ../perf-test/SKILL.md.
The perf-test skill will:
- Start vllm serve with recommended stack
- Copy
scripts/run_all_benchmarks.pyinto container → run 5 profiles - Collect metrics and produce summary table
Store result in pipeline state.
Step 5: Final Report
Compile all results from pipeline state into a final report:
{
"status": "PASS | PARTIAL | FAIL",
"pipeline": "flagrelease",
"container": "<name>",
"vendor": "<vendor>",
"model": "<path>",
"tensor_parallel_size": 8,
"steps": {
"install_stack": { "status": "...", "packages": {...} },
"env_verify": { "status": "...", "phase_a": "...", "phase_b": "..." },
"model_verify": { "status": "...", "run_a": "...", "run_b": "...", "recommended_stack": "..." },
"perf_test": { "status": "...", "profiles_passed": "5/5", "summary_table": "..." }
},
"errors": [...],
"conclusion": "Pipeline completed. ..."
}
Present to user with clear summary:
- Which packages installed / failed
- Whether base stack works
- Whether multi-chip stack works (and which component failed if not)
- Performance numbers (summary table)
- All errors with layer classification
Overall status:
PASS— all steps pass, full multi-chip stack worksPARTIAL— model works with degraded stack, or some perf profiles failedFAIL— model cannot serve (gate or Run A failure)
Design Rules
- Every operation has a timeout — no hangs allowed
- Every error is caught with precise location (step, phase, layer, cause)
- Pipeline always completes with success or structured error report
- One sub-step failure does NOT skip unrelated steps (unless gate failure)
- Network uses mirrors when direct access fails
- Scripts produce JSON — structured, parseable, comparable across runs