vllm-plugin-fl-setup-flagos
vLLM-Plugin-FL Setup
Overview
vLLM-Plugin-FL extends vLLM to support model inference/serving across diverse hardware backends (NVIDIA, Ascend, MetaX, Iluvatar, etc.) via FlagOS's unified operator library FlagGems and communication library FlagCX. This skill covers installation, hardware-specific environment configuration, and dependency setup.
Prerequisites
- Linux OS (Ubuntu 20.04+ recommended)
- Python 3.10+
- vLLM v0.13.0 — install from the official v0.13.0 release or the fork vllm-FL
- GPU with appropriate drivers (NVIDIA CUDA, Huawei Ascend, etc.)
pippackage manager- Git
Verify vLLM version before proceeding:
python -c "import vllm; print(vllm.__version__)"
# Expected output: 0.13.0
Installation Workflow
Step 1: Identify Hardware Backend
# NVIDIA GPU
nvidia-smi
# Huawei NPU
npu-smi info
# Moore Threads GPU
mthreads-gmi
# Iluvatar GPU
ixsmi
Step 2: Install vLLM-Plugin-FL
First create a workspace directory and try cloning the source code:
mkdir -p ~/flagos-workspace && cd ~/flagos-workspace
git clone https://github.com/flagos-ai/vllm-plugin-FL
If git clone fails due to network issues, ask the user for their network proxy settings (e.g. http_proxy / https_proxy), configure the proxy, then retry the clone.
Then install from the source directory:
cd vllm-plugin-FL
pip install -r requirements.txt
pip install --no-build-isolation .
# Required to enable vLLM-Plugin-FL when running vLLM
export VLLM_PLUGINS='fl'
Verify vLLM-Plugin-FL installation:
python -c "import vllm_fl; print('vllm-plugin-FL installed successfully')"
Step 3: Install FlagGems
Ascend NPU users: Before installing FlagGems, you must first install FlagTree. See references/npu.md and complete the FlagTree installation step there before proceeding. Otherwise the FlagGems verification will fail repeatedly and keep reinstalling Triton.
# Install build dependencies
pip install -U scikit-build-core==0.11 pybind11 ninja cmake
# Clone FlagGems source code
cd ~/flagos-workspace
git clone https://github.com/flagos-ai/FlagGems
If git clone fails due to network issues, ask the user for their network proxy settings (e.g. http_proxy / https_proxy), configure the proxy, then retry the clone.
Then install from the source directory:
cd FlagGems
pip install --no-build-isolation .
Verify FlagGems installation:
python -c "import flag_gems; print('FlagGems installed successfully')"
Step 4: (Optional) Install FlagCX
FlagCX is a unified communication library for multi-device distributed inference, supporting both homogeneous and heterogeneous setups. Skip this step if running on a single device.
Note: Ascend NPU does not need FlagCX — skip this step for Ascend backends.
cd ~/flagos-workspace
git clone https://github.com/flagos-ai/FlagCX.git
If git clone fails due to network issues, ask the user for their network proxy settings (e.g. http_proxy / https_proxy), configure the proxy, then retry the clone.
Then build and install from the source directory:
cd FlagCX
git submodule update --init --recursive
# Build for your platform (e.g. USE_NVIDIA=1 for NVIDIA)
make USE_NVIDIA=1
export FLAGCX_PATH="$PWD"
# Install Python binding (replace [xxx] with your platform: nvidia, ascend, etc.)
cd plugin/torch/
FLAGCX_ADAPTOR=[xxx] pip install --no-build-isolation .
Verify FlagCX installation:
python -c "import flagcx; print('FlagCX installed successfully')"
Step 5: Backend-Specific Setup
Some hardware backends require additional setup. See the corresponding reference document:
| Backend | Chip Vendor | Reference |
|---|---|---|
| Ascend NPU | Huawei | references/npu.md |
| MetaX GPU | MetaX | TBD |
| Iluvatar GPU (BI-V150) | Iluvatar | references/iluvatar_gpu.md |
| Pingtouge-Zhenwu | Pingtouge | TBD |
| Tsingmicro | Tsingmicro | TBD |
| Moore Threads GPU | Moore Threads | references/mthreads_gpu.md |
| Hygon DCU | Hygon | TBD |
Quick Test
- Ask the user for the model name they want to test (e.g.
Qwen3-4B,DeepSeek-R1). - Search the machine for a local copy of that model:
find / -maxdepth 5 -type d -name "<user_provided_model_name>" 2>/dev/null - If found, use the discovered path. If not found, tell the user and ask them to provide a different model name or a full local path, then repeat the search. If after 3 attempts no valid model is found, skip the quick test and inform the user to prepare a model before retrying.
- Ensure the FL plugin is enabled before running inference:
For Moore Threads GPU, also set:export VLLM_PLUGINS='fl'export USE_FLAGGEMS=1 export FLAGCX_PATH=/workspace/FlagCX # MUST point to the actual FlagCX installation directory; this is only an example export VLLM_MUSA_ENABLE_MOE_TRITON=1 - Once a valid model path is resolved, run offline batched inference to verify the full stack:
from vllm import LLM, SamplingParams
model_path = "<resolved_model_path>"
prompts = [
"Hello, my name is",
]
sampling_params = SamplingParams(max_tokens=10, temperature=0.0)
# For Moore Threads GPU, add: enforce_eager=True, block_size=64, attention_config={"backend": "TORCH_SDPA"}
# For Iluvatar BI-V150, add: enforce_eager=True
llm = LLM(model=model_path, max_num_batched_tokens=16384, max_num_seqs=2048)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Troubleshooting
Out of memory on model load: Use gpu_memory_utilization parameter to limit memory. Start with 0.8 and adjust:
from vllm import LLM
llm = LLM(model="...", gpu_memory_utilization=0.8)
FlagGems build failures: Ensure build dependencies are installed (scikit-build-core, pybind11, ninja, cmake). Check that your compiler supports C++17.
Plugin not loaded: If vLLM does not use the FL plugin, verify that VLLM_PLUGINS='fl' is set in your environment.
FlagCX communication errors: Ensure FLAGCX_PATH is correctly set and the library was built for your platform. For NVIDIA, verify with make USE_NVIDIA=1.
Ascend-specific issues: See references/npu.md for Ascend NPU troubleshooting, including FlagTree setup and eager execution requirements.
Cannot connect to GitHub: Ask the user for their network proxy settings (e.g. http_proxy / https_proxy), configure the proxy, then retry the git clone command.
References
- vLLM-Plugin-FL GitHub
- FlagGems GitHub
- FlagCX GitHub
- For non-NVIDIA chips, refer to the references directory for hardware-specific configurations and setup instructions