vllm-ascend
vLLM-Ascend - LLM Inference Serving
vLLM-Ascend is a plugin for vLLM that enables efficient LLM inference on Huawei Ascend AI processors. It provides Ascend-optimized kernels, quantization support, and distributed inference capabilities.
Quick Start
Offline Batch Inference
import os
# Required for vLLM-Ascend: set multiprocessing method before importing vLLM
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
from vllm import LLM, SamplingParams
# Load model with Ascend NPU (device auto-detected when vllm-ascend is installed)
llm = LLM(
model="Qwen/Qwen2.5-7B-Instruct",
max_model_len=4096
)
# Prepare prompts and sampling params
prompts = [
"Hello, how are you?",
"Explain quantum computing in simple terms.",
]
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)
# Generate outputs
outputs = llm.generate(prompts, sampling_params)
# Print results
for output in outputs:
print(f"Prompt: {output.prompt}")
print(f"Output: {output.outputs[0].text}\n")
OpenAI-Compatible API Server
# Start the API server
vllm serve Qwen/Qwen2.5-7B-Instruct \
--max-model-len 4096 \
--max-num-seqs 256 \
--served-model-name "qwen2.5-7b"
# Or using Python
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct \
--max-model-len 4096
API Client Example
import requests
# Completions API
response = requests.post(
"http://localhost:8000/v1/completions",
json={
"model": "qwen2.5-7b",
"prompt": "Once upon a time",
"max_tokens": 100,
"temperature": 0.7
}
)
print(response.json())
# Chat Completions API
response = requests.post(
"http://localhost:8000/v1/chat/completions",
json={
"model": "qwen2.5-7b",
"messages": [
{"role": "user", "content": "Hello!"}
],
"max_tokens": 100
}
)
print(response.json())
Installation
Prerequisites
- CANN: 8.0.RC1 or higher
- Python: 3.9 or higher
- PyTorch Ascend: Compatible with your CANN version
Method 1: Docker (Recommended)
# Pull pre-built image
docker pull ascendai/vllm-ascend:latest
# Run with NPU access
docker run -it --rm \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/add-ons:/usr/local/Ascend/add-ons \
-e ASCEND_RT_VISIBLE_DEVICES=0 \
ascendai/vllm-ascend:latest
Method 2: pip Installation
# Install vLLM with Ascend plugin
pip install vllm-ascend
# Or install from source
git clone https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
pip install -e .
Verify Installation
# Check vLLM Ascend installation
python -c "import vllm_ascend; print(vllm_ascend.__version__)"
# Check NPU availability
python -c "import torch; import torch_npu; print(torch_npu.npu.device_count())"
Deployment
Server Mode
# Basic server deployment
vllm serve <model_path> \
\
--served-model-name <name> \
--host 0.0.0.0 \
--port 8000
# Production deployment with optimizations
vllm serve /path/to/model \
\
--served-model-name "qwen2.5-72b" \
--max-model-len 8192 \
--max-num-seqs 256 \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9 \
--dtype bfloat16 \
--api-key <your-api-key>
Python API
import os
# Required: Set spawn method before importing vLLM
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
from vllm import LLM, SamplingParams
# Single NPU
llm = LLM(
model="Qwen/Qwen2.5-7B-Instruct",
max_model_len=4096,
dtype="bfloat16"
)
# Distributed inference (multi-NPU)
llm = LLM(
model="Qwen/Qwen2.5-72B-Instruct",
tensor_parallel_size=4,
max_model_len=8192
)
# Generate
params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["Hello world"], params)
LLM Engine (Advanced)
from vllm import LLMEngine, EngineArgs, SamplingParams
engine_args = EngineArgs(
model="Qwen/Qwen2.5-7B-Instruct",
max_model_len=4096
)
engine = LLMEngine.from_engine_args(engine_args)
# Add requests and step through generation
request_id = "req-001"
prompt = "Hello, world!"
params = SamplingParams(max_tokens=50)
engine.add_request(request_id, prompt, params)
while engine.has_unfinished_requests():
outputs = engine.step()
for output in outputs:
if output.finished:
print(f"{output.request_id}: {output.outputs[0].text}")
Quantization
vLLM-Ascend supports models quantized with msModelSlim. For quantization details, see msmodelslim.
Using Quantized Models
# W8A8 quantized model
vllm serve /path/to/quantized-model-w8a8 \
\
--quantization ascend \
--max-model-len 4096
# W4A8 quantized model
vllm serve /path/to/quantized-model-w4a8 \
\
--quantization ascend \
--max-model-len 4096
Python API with Quantization
from vllm import LLM, SamplingParams
llm = LLM(
model="/path/to/quantized-model",
quantization="ascend",
max_model_len=4096
)
params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["Hello"], params)
Distributed Inference
Tensor Parallelism
Distributes model layers across multiple NPUs for large models.
# 4-way tensor parallelism
vllm serve Qwen/Qwen2.5-72B-Instruct \
--tensor-parallel-size 4 \
--max-model-len 8192
import os
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
from vllm import LLM
llm = LLM(
model="Qwen/Qwen2.5-72B-Instruct",
tensor_parallel_size=4,
max_model_len=8192
)
Pipeline Parallelism
from vllm import LLM
llm = LLM(
model="DeepSeek-V3",
pipeline_parallel_size=2,
tensor_parallel_size=4
)
Multi-Node Deployment
# Node 0 (Rank 0)
vllm serve <model> \
\
--tensor-parallel-size 8 \
--pipeline-parallel-size 2 \
--distributed-init-method "tcp://192.168.1.10:29500" \
--distributed-rank 0
# Node 1 (Rank 1)
vllm serve <model> \
\
--tensor-parallel-size 8 \
--pipeline-parallel-size 2 \
--distributed-init-method "tcp://192.168.1.10:29500" \
--distributed-rank 1
Performance Optimization
Key Parameters
| Parameter | Default | Description | Tuning Advice |
|---|---|---|---|
--max-model-len |
Model max | Maximum sequence length | Reduce if OOM |
--max-num-seqs |
256 | Max concurrent sequences | Increase for throughput |
--gpu-memory-utilization |
0.9 | GPU memory fraction | Lower if OOM during warmup |
--dtype |
auto | Data type | bfloat16 for speed, float16 for compatibility |
--tensor-parallel-size |
1 | Tensor parallelism degree | Use for large models |
--pipeline-parallel-size |
1 | Pipeline parallelism degree | Use for very large models |
Example Configurations
# Small model (7B), single NPU
vllm serve <model> --max-model-len 4096 --max-num-seqs 256
# Medium model (32B), single NPU
vllm serve <model> --max-model-len 8192 --max-num-seqs 128
# Large model (72B), multi-NPU
vllm serve <model> --tensor-parallel-size 4 --max-model-len 8192
# Maximum throughput
vllm serve <model> --max-num-seqs 512 --gpu-memory-utilization 0.95
Troubleshooting
Common Issues
Q: AclNN_Parameter_Error or dtype errors?
# Check CANN version compatibility
npu-smi info
# Ensure CANN >= 8.0.RC1
# Try different dtype
vllm serve <model> --dtype float16
Q: Out of Memory (OOM)?
# Reduce max model length
vllm serve <model> --max-model-len 2048
# Lower memory utilization
vllm serve <model> --gpu-memory-utilization 0.8
# Reduce concurrent sequences
vllm serve <model> --max-num-seqs 128
Q: Model loading fails?
# Check model path
ls /path/to/model
# Verify tokenizer
python -c "from transformers import AutoTokenizer; tok = AutoTokenizer.from_pretrained('/path/to/model'); print('OK')"
# Use trust_remote_code for custom models
vllm serve <model> --trust-remote-code
Q: Slow inference?
# Enable bfloat16 for faster compute
vllm serve <model> --dtype bfloat16
# Adjust block size
vllm serve <model> --block-size 256
# Enable prefix caching
vllm serve <model> --enable-prefix-caching
Q: API server connection refused?
# Check server is running
curl http://localhost:8000/health
# Verify port is not in use
lsof -i :8000
# Use explicit host/port
vllm serve <model> --host 0.0.0.0 --port 8000
Environment Variables
# Required: Set multiprocessing method for vLLM-Ascend
export VLLM_WORKER_MULTIPROC_METHOD=spawn
# Set Ascend device IDs
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
# Debug logging
export VLLM_LOGGING_LEVEL=DEBUG
# Disable lazy initialization (for debugging)
export VLLM_ASCEND_LAZY_INIT=0
Scripts
scripts/benchmark_throughput.py- Throughput benchmarkscripts/benchmark_latency.py- Latency benchmarkscripts/start_server.sh- Server startup template
References
- references/deployment.md - Deployment patterns and best practices
- references/supported-models.md - Complete model support matrix
- references/api-reference.md - API endpoint documentation
Related Skills
- msmodelslim - Model quantization for vLLM-Ascend
- ascend-docker - Docker container setup for Ascend
- npu-smi - NPU device management
- hccl-test - HCCL performance testing for multi-NPU
Official References
- vLLM-Ascend Documentation: https://docs.vllm.ai/projects/ascend/en/latest/
- vLLM Documentation: https://docs.vllm.ai/
- Huawei Ascend: https://www.hiascend.com/document
- GitHub Repository: https://github.com/vllm-project/vllm-ascend
More from ascend-ai-coding/awesome-ascend-skills
npu-smi
Huawei Ascend NPU npu-smi command reference. Use for device queries (health, temperature, power, memory, processes, ECC), configuration (thresholds, modes, fan), firmware upgrades (MCU, bootloader, VRD), virtualization (vNPU), and certificate management.
66atc-model-converter
Complete toolkit for Huawei Ascend NPU model conversion and end-to-end inference adaptation. Workflow 1 auto-discovers input shapes and parameters from user source code. Workflow 2 exports PyTorch models to ONNX. Workflow 3 converts ONNX to .om via ATC with multi-CANN version support. Workflow 4 adapts the user's full inference pipeline (preprocessing + model + postprocessing) to run end-to-end on NPU. Workflow 5 verifies precision between ONNX and OM outputs. Workflow 6 generates a reproducible README. Supports any standard PyTorch/ONNX model. Use when converting, testing, or deploying models on Ascend AI processors.
55ascendc
AscendC transformer/GMM/MoE 算子与 Matmul/Cube Kernel 的统一开发规范。用于在 ops-transformer 下新增或修改 op_host、tiling/infershape、op_kernel(含 MatmulImpl/Cube 调用)、以及对应的 CANN aclnn 示例和单测。
51ascend-docker
Create Docker containers for Huawei Ascend NPU development with proper device mappings and volume mounts. Use when setting up Ascend development environments in Docker, running CANN applications in containers, or creating isolated NPU development workspaces. Supports privileged mode (default), basic mode, and full mode with profiling/logging. Auto-detects available NPU devices.
51msmodelslim
Huawei Ascend NPU model compression tool (msModelSlim). Use for LLM quantization (W4A8, W8A8, W8A8S, W8A16), MoE model compression, multimodal model compression (Qwen-VL, InternVL, HunyuanVideo, FLUX, SD3), calibration data preparation, precision auto-tuning, sensitive layer analysis, custom model integration, and deployment in MindIE/vLLM-Ascend. Supports Qwen, LLaMA, DeepSeek, GLM, Kimi, InternLM and more.
44ais-bench
AISBench Benchmark - AI model evaluation tool for Ascend NPU. Supports accuracy evaluation (service/local models on text, multimodal datasets), performance evaluation (latency, throughput, stress testing, steady-state, real traffic simulation), vLLM/Triton inference services, 15+ benchmarks (MMLU, GSM8K, MMMU, docvqa, ocrbench_v2, etc.), multi-turn dialogue, Function Call (BFCL), and custom datasets.
41