vLLM-Plugin-FL Setup

Overview

vLLM-Plugin-FL extends vLLM to support model inference/serving across diverse hardware backends (NVIDIA, Ascend, MetaX, Iluvatar, etc.) via FlagOS's unified operator library FlagGems and communication library FlagCX. This skill covers installation, hardware-specific environment configuration, and dependency setup.

Prerequisites

Linux OS (Ubuntu 20.04+ recommended)
Python 3.10+
vLLM v0.13.0 — install from the official v0.13.0 release or the fork vllm-FL
GPU with appropriate drivers (NVIDIA CUDA, Huawei Ascend, etc.)
pip package manager
Git

Verify vLLM version before proceeding:

python -c "import vllm; print(vllm.__version__)"
# Expected output: 0.13.0

Installation Workflow

Step 1: Identify Hardware Backend

# NVIDIA GPU
nvidia-smi

# Huawei NPU
npu-smi info

# Moore Threads GPU
mthreads-gmi

# Iluvatar GPU
ixsmi

Step 2: Install vLLM-Plugin-FL

First create a workspace directory and try cloning the source code:

mkdir -p ~/flagos-workspace && cd ~/flagos-workspace
git clone https://github.com/flagos-ai/vllm-plugin-FL

If git clone fails due to network issues, ask the user for their network proxy settings (e.g. http_proxy / https_proxy), configure the proxy, then retry the clone.

Then install from the source directory:

cd vllm-plugin-FL
pip install -r requirements.txt
pip install --no-build-isolation .
# Required to enable vLLM-Plugin-FL when running vLLM
export VLLM_PLUGINS='fl'

Verify vLLM-Plugin-FL installation:

python -c "import vllm_fl; print('vllm-plugin-FL installed successfully')"

Step 3: Install FlagGems

Ascend NPU users: Before installing FlagGems, you must first install FlagTree. See references/npu.md and complete the FlagTree installation step there before proceeding. Otherwise the FlagGems verification will fail repeatedly and keep reinstalling Triton.

# Install build dependencies
pip install -U scikit-build-core==0.11 pybind11 ninja cmake

# Clone FlagGems source code
cd ~/flagos-workspace
git clone https://github.com/flagos-ai/FlagGems

If git clone fails due to network issues, ask the user for their network proxy settings (e.g. http_proxy / https_proxy), configure the proxy, then retry the clone.

Then install from the source directory:

cd FlagGems
pip install --no-build-isolation .

Verify FlagGems installation:

python -c "import flag_gems; print('FlagGems installed successfully')"

Step 4: (Optional) Install FlagCX

FlagCX is a unified communication library for multi-device distributed inference, supporting both homogeneous and heterogeneous setups. Skip this step if running on a single device.

Note: Ascend NPU does not need FlagCX — skip this step for Ascend backends.

cd ~/flagos-workspace
git clone https://github.com/flagos-ai/FlagCX.git

If git clone fails due to network issues, ask the user for their network proxy settings (e.g. http_proxy / https_proxy), configure the proxy, then retry the clone.

Then build and install from the source directory:

cd FlagCX

git submodule update --init --recursive

# Build for your platform (e.g. USE_NVIDIA=1 for NVIDIA)
make USE_NVIDIA=1

export FLAGCX_PATH="$PWD"

# Install Python binding (replace [xxx] with your platform: nvidia, ascend, etc.)
cd plugin/torch/
FLAGCX_ADAPTOR=[xxx] pip install --no-build-isolation .

Verify FlagCX installation:

python -c "import flagcx; print('FlagCX installed successfully')"

Step 5: Backend-Specific Setup

Some hardware backends require additional setup. See the corresponding reference document:

Backend	Chip Vendor	Reference
Ascend NPU	Huawei	references/npu.md
MetaX GPU	MetaX	TBD
Iluvatar GPU (BI-V150)	Iluvatar	references/iluvatar_gpu.md
Pingtouge-Zhenwu	Pingtouge	TBD
Tsingmicro	Tsingmicro	TBD
Moore Threads GPU	Moore Threads	references/mthreads_gpu.md
Hygon DCU	Hygon	TBD

Quick Test

Ask the user for the model name they want to test (e.g. Qwen3-4B, DeepSeek-R1).

Search the machine for a local copy of that model:

find / -maxdepth 5 -type d -name "<user_provided_model_name>" 2>/dev/null

If found, use the discovered path. If not found, tell the user and ask them to provide a different model name or a full local path, then repeat the search. If after 3 attempts no valid model is found, skip the quick test and inform the user to prepare a model before retrying.

Ensure the FL plugin is enabled before running inference:

export VLLM_PLUGINS='fl'

For Moore Threads GPU, also set:

export USE_FLAGGEMS=1
export FLAGCX_PATH=/workspace/FlagCX  # MUST point to the actual FlagCX installation directory; this is only an example
export VLLM_MUSA_ENABLE_MOE_TRITON=1

Once a valid model path is resolved, run offline batched inference to verify the full stack:

from vllm import LLM, SamplingParams

model_path = "<resolved_model_path>"
prompts = [
    "Hello, my name is",
]
sampling_params = SamplingParams(max_tokens=10, temperature=0.0)

# For Moore Threads GPU, add: enforce_eager=True, block_size=64, attention_config={"backend": "TORCH_SDPA"}
# For Iluvatar BI-V150, add: enforce_eager=True
llm = LLM(model=model_path, max_num_batched_tokens=16384, max_num_seqs=2048)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Troubleshooting

Out of memory on model load: Use gpu_memory_utilization parameter to limit memory. Start with 0.8 and adjust:

from vllm import LLM
llm = LLM(model="...", gpu_memory_utilization=0.8)

FlagGems build failures: Ensure build dependencies are installed (scikit-build-core, pybind11, ninja, cmake). Check that your compiler supports C++17.

Plugin not loaded: If vLLM does not use the FL plugin, verify that VLLM_PLUGINS='fl' is set in your environment.

FlagCX communication errors: Ensure FLAGCX_PATH is correctly set and the library was built for your platform. For NVIDIA, verify with make USE_NVIDIA=1.

Ascend-specific issues: See references/npu.md for Ascend NPU troubleshooting, including FlagTree setup and eager execution requirements.

Cannot connect to GitHub: Ask the user for their network proxy settings (e.g. http_proxy / https_proxy), configure the proxy, then retry the git clone command.

References

vLLM-Plugin-FL GitHub
FlagGems GitHub
FlagCX GitHub
For non-NVIDIA chips, refer to the references directory for hardware-specific configurations and setup instructions

vllm-plugin-fl-setup-flagos

vLLM-Plugin-FL Setup

Overview

Prerequisites

Installation Workflow

Step 1: Identify Hardware Backend

Step 2: Install vLLM-Plugin-FL

Step 3: Install FlagGems

Step 4: (Optional) Install FlagCX

Step 5: Backend-Specific Setup

Quick Test

Troubleshooting

References

More from flagos-ai/skills

kernelgen-flagos

model-migrate-flagos

tle-developer-flagos

gpu-container-setup-flagos

skill-creator-flagos

flagrelease-entrance-flagos