flagrelease

Installation
SKILL.md

FlagRelease Pipeline Orchestrator

End-to-end LLM deployment + testing pipeline for multi-chip GPU backends. Orchestrates 4 sub-skills in sequence and produces a final report.

Skill Components

flagrelease/
├── SKILL.md                            # This file — orchestration flow
└── references/
    └── pipeline-state.md               # Pipeline state schema, gate logic, data flow

Sub-skills (each independently invokable):

../install-stack/                       # Step 2: Install 5 packages
│   ├── SKILL.md
│   ├── scripts/
│   │   ├── detect_network.py           # Probe GitHub/PyPI, return mirror config
│   │   ├── collect_env_info.py         # Python/glibc/arch/vendor/disk info
│   │   ├── select_flagtree_wheel.py    # Match vendor+python+glibc → wheel
│   │   └── validate_packages.py        # Import-test all 5 packages
│   └── references/
│       ├── vendor-mappings.md          # FlagCX make flags, adaptor names
│       └── network-mirrors.md          # Mirror config rules

../env-verify/                          # Step 3: Qwen3-0.6B smoke test
│   ├── SKILL.md
│   ├── scripts/
│   │   ├── run_offline_inference.py    # Phase A: offline inference test
│   │   └── test_serve_mode.py          # Phase B: serve + health + chat test
│   └── references/
│       └── error-classification.md     # Layer-based error classification

../model-verify/                        # Step 4: Target model ± multi-chip
│   ├── SKILL.md
│   ├── scripts/
│   │   └── diff_analysis.py            # Compare Run A vs Run B results
│   └── references/
│       └── multichip-errors.md         # Multi-chip error patterns

../perf-test/                           # Steps 5+6: Accuracy + Performance
│   ├── SKILL.md
│   ├── scripts/
│   │   ├── run_benchmark.py            # Run single benchmark profile
│   │   └── run_all_benchmarks.py       # Run all profiles + summarize
│   └── references/
│       └── benchmark-profiles.md       # Profile definitions and metrics

Pipeline Overview

[Prerequisite: /gpu-container-setup already done by another team]
  install-stack   →  Install 5 packages (vLLM, FlagTree, FlagGems, FlagCX, plugin)
       │                scripts: detect_network, collect_env_info, select_flagtree_wheel
       │  GATE: vLLM + plugin must succeed
  env-verify      →  Smoke test with Qwen3-0.6B (FlagGems/CX OFF)
       │                scripts: run_offline_inference, test_serve_mode
       │  Verify Layers 0-3
  model-verify    →  Target model test (OFF then ON), diff analysis
       │                scripts: run_offline_inference, test_serve_mode, diff_analysis
       │  Determine which stack works (full vs base)
  perf-test       →  Accuracy (placeholder) + Performance benchmarks
       │                scripts: run_benchmark, run_all_benchmarks
  Final Report

Prerequisites

A running Docker container with:

  • PyTorch installed and GPU-accessible
  • Container name known (e.g. flagrelease-worker)

This container is produced by /gpu-container-setup (maintained by another team).

Execution Flow

Read references/pipeline-state.md for the full state schema and gate logic.

Step 0: Gather Initial Context

Ask user for container name (or detect running containers):

docker ps --format '{{.Names}}' | head -10

Verify the container is running:

docker inspect --format='{{.State.Status}}' <CONTAINER> | grep -q running

Initialize pipeline state (see references/pipeline-state.md).

Step 1: Install Software Stack

Read and follow ../install-stack/SKILL.md.

The install-stack skill will:

  1. Copy scripts/collect_env_info.py into container → get vendor, Python, glibc
  2. Copy scripts/detect_network.py into container → get mirror config
  3. Install 5 packages in order, using scripts/select_flagtree_wheel.py for FlagTree
  4. Run scripts/validate_packages.py inside container → get final status

Gate check: If gate_passed is false (vLLM or plugin failed) → STOP pipeline. Report FAIL with install errors.

Store result in pipeline state.

Step 2: Environment Verification

Read and follow ../env-verify/SKILL.md.

The env-verify skill will:

  1. Download Qwen3-0.6B (if not cached)
  2. Copy scripts/run_offline_inference.py into container → Phase A
  3. Copy scripts/test_serve_mode.py into container → Phase B
  4. Classify errors using references/error-classification.md

Decision: Fatal error → STOP. Non-fatal → record and continue.

Store result in pipeline state.

Step 3: Model Verification

Read and follow ../model-verify/SKILL.md.

This step is interactive — will ask user for model path.

The model-verify skill will:

  1. Get model info from user (AskUserQuestion)
  2. Reuse run_offline_inference.py and test_serve_mode.py for Run A and Run B
  3. Run scripts/diff_analysis.py to compare results
  4. Determine recommended_stack (full/base/none)

Decision: If recommended_stack is none (Run A failed) → STOP.

Store result in pipeline state (including model_path, tp_size, recommended_stack).

Step 4: Performance Test

Read and follow ../perf-test/SKILL.md.

The perf-test skill will:

  1. Start vllm serve with recommended stack
  2. Copy scripts/run_all_benchmarks.py into container → run 5 profiles
  3. Collect metrics and produce summary table

Store result in pipeline state.

Step 5: Final Report

Compile all results from pipeline state into a final report:

{
  "status": "PASS | PARTIAL | FAIL",
  "pipeline": "flagrelease",
  "container": "<name>",
  "vendor": "<vendor>",
  "model": "<path>",
  "tensor_parallel_size": 8,
  "steps": {
    "install_stack": { "status": "...", "packages": {...} },
    "env_verify":    { "status": "...", "phase_a": "...", "phase_b": "..." },
    "model_verify":  { "status": "...", "run_a": "...", "run_b": "...", "recommended_stack": "..." },
    "perf_test":     { "status": "...", "profiles_passed": "5/5", "summary_table": "..." }
  },
  "errors": [...],
  "conclusion": "Pipeline completed. ..."
}

Present to user with clear summary:

  1. Which packages installed / failed
  2. Whether base stack works
  3. Whether multi-chip stack works (and which component failed if not)
  4. Performance numbers (summary table)
  5. All errors with layer classification

Overall status:

  • PASS — all steps pass, full multi-chip stack works
  • PARTIAL — model works with degraded stack, or some perf profiles failed
  • FAIL — model cannot serve (gate or Run A failure)

Design Rules

  • Every operation has a timeout — no hangs allowed
  • Every error is caught with precise location (step, phase, layer, cause)
  • Pipeline always completes with success or structured error report
  • One sub-step failure does NOT skip unrelated steps (unless gate failure)
  • Network uses mirrors when direct access fails
  • Scripts produce JSON — structured, parseable, comparable across runs
Related skills
Installs
1
GitHub Stars
10
First Seen
Mar 25, 2026