vllm-ascend
SKILL.md
vLLM-Ascend - LLM Inference Serving
vLLM-Ascend is a plugin for vLLM that enables efficient LLM inference on Huawei Ascend AI processors. It provides Ascend-optimized kernels, quantization support, and distributed inference capabilities.
Quick Start
Offline Batch Inference
import os
# Required for vLLM-Ascend: set multiprocessing method before importing vLLM
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
from vllm import LLM, SamplingParams
# Load model with Ascend NPU (device auto-detected when vllm-ascend is installed)
llm = LLM(
model="Qwen/Qwen2.5-7B-Instruct",
max_model_len=4096
)
# Prepare prompts and sampling params
prompts = [
"Hello, how are you?",
"Explain quantum computing in simple terms.",
]
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)
# Generate outputs
outputs = llm.generate(prompts, sampling_params)
# Print results
for output in outputs:
print(f"Prompt: {output.prompt}")
print(f"Output: {output.outputs[0].text}\n")
OpenAI-Compatible API Server
# Start the API server
vllm serve Qwen/Qwen2.5-7B-Instruct \
--max-model-len 4096 \
--max-num-seqs 256 \
--served-model-name "qwen2.5-7b"
# Or using Python
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct \
--max-model-len 4096
API Client Example
import requests
# Completions API
response = requests.post(
"http://localhost:8000/v1/completions",
json={
"model": "qwen2.5-7b",
"prompt": "Once upon a time",
"max_tokens": 100,
"temperature": 0.7
}
)
print(response.json())
# Chat Completions API
response = requests.post(
"http://localhost:8000/v1/chat/completions",
json={
"model": "qwen2.5-7b",
"messages": [
{"role": "user", "content": "Hello!"}
],
"max_tokens": 100
}
)
print(response.json())
Installation
Prerequisites
- CANN: 8.0.RC1 or higher
- Python: 3.9 or higher
- PyTorch Ascend: Compatible with your CANN version
Method 1: Docker (Recommended)
# Pull pre-built image
docker pull ascendai/vllm-ascend:latest
# Run with NPU access
docker run -it --rm \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/add-ons:/usr/local/Ascend/add-ons \
-e ASCEND_RT_VISIBLE_DEVICES=0 \
ascendai/vllm-ascend:latest
Method 2: pip Installation
# Install vLLM with Ascend plugin
pip install vllm-ascend
# Or install from source
git clone https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
pip install -e .
Verify Installation
# Check vLLM Ascend installation
python -c "import vllm_ascend; print(vllm_ascend.__version__)"
# Check NPU availability
python -c "import torch; import torch_npu; print(torch_npu.npu.device_count())"
Deployment
Server Mode
# Basic server deployment
vllm serve <model_path> \
\
--served-model-name <name> \
--host 0.0.0.0 \
--port 8000
# Production deployment with optimizations
vllm serve /path/to/model \
\
--served-model-name "qwen2.5-72b" \
--max-model-len 8192 \
--max-num-seqs 256 \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9 \
--dtype bfloat16 \
--api-key <your-api-key>
Python API
import os
# Required: Set spawn method before importing vLLM
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
from vllm import LLM, SamplingParams
# Single NPU
llm = LLM(
model="Qwen/Qwen2.5-7B-Instruct",
max_model_len=4096,
dtype="bfloat16"
)
# Distributed inference (multi-NPU)
llm = LLM(
model="Qwen/Qwen2.5-72B-Instruct",
tensor_parallel_size=4,
max_model_len=8192
)
# Generate
params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["Hello world"], params)
LLM Engine (Advanced)
from vllm import LLMEngine, EngineArgs, SamplingParams
engine_args = EngineArgs(
model="Qwen/Qwen2.5-7B-Instruct",
max_model_len=4096
)
engine = LLMEngine.from_engine_args(engine_args)
# Add requests and step through generation
request_id = "req-001"
prompt = "Hello, world!"
params = SamplingParams(max_tokens=50)
engine.add_request(request_id, prompt, params)
while engine.has_unfinished_requests():
outputs = engine.step()
for output in outputs:
if output.finished:
print(f"{output.request_id}: {output.outputs[0].text}")
Quantization
vLLM-Ascend supports models quantized with msModelSlim. For quantization details, see msmodelslim.
Using Quantized Models
# W8A8 quantized model
vllm serve /path/to/quantized-model-w8a8 \
\
--quantization ascend \
--max-model-len 4096
# W4A8 quantized model
vllm serve /path/to/quantized-model-w4a8 \
\
--quantization ascend \
--max-model-len 4096
Python API with Quantization
from vllm import LLM, SamplingParams
llm = LLM(
model="/path/to/quantized-model",
quantization="ascend",
max_model_len=4096
)
params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["Hello"], params)
Distributed Inference
Tensor Parallelism
Distributes model layers across multiple NPUs for large models.
# 4-way tensor parallelism
vllm serve Qwen/Qwen2.5-72B-Instruct \
--tensor-parallel-size 4 \
--max-model-len 8192
import os
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
from vllm import LLM
llm = LLM(
model="Qwen/Qwen2.5-72B-Instruct",
tensor_parallel_size=4,
max_model_len=8192
)
Pipeline Parallelism
from vllm import LLM
llm = LLM(
model="DeepSeek-V3",
pipeline_parallel_size=2,
tensor_parallel_size=4
)
Multi-Node Deployment
# Node 0 (Rank 0)
vllm serve <model> \
\
--tensor-parallel-size 8 \
--pipeline-parallel-size 2 \
--distributed-init-method "tcp://192.168.1.10:29500" \
--distributed-rank 0
# Node 1 (Rank 1)
vllm serve <model> \
\
--tensor-parallel-size 8 \
--pipeline-parallel-size 2 \
--distributed-init-method "tcp://192.168.1.10:29500" \
--distributed-rank 1
Performance Optimization
Key Parameters
| Parameter | Default | Description | Tuning Advice |
|---|---|---|---|
--max-model-len |
Model max | Maximum sequence length | Reduce if OOM |
--max-num-seqs |
256 | Max concurrent sequences | Increase for throughput |
--gpu-memory-utilization |
0.9 | GPU memory fraction | Lower if OOM during warmup |
--dtype |
auto | Data type | bfloat16 for speed, float16 for compatibility |
--tensor-parallel-size |
1 | Tensor parallelism degree | Use for large models |
--pipeline-parallel-size |
1 | Pipeline parallelism degree | Use for very large models |
Example Configurations
# Small model (7B), single NPU
vllm serve <model> --max-model-len 4096 --max-num-seqs 256
# Medium model (32B), single NPU
vllm serve <model> --max-model-len 8192 --max-num-seqs 128
# Large model (72B), multi-NPU
vllm serve <model> --tensor-parallel-size 4 --max-model-len 8192
# Maximum throughput
vllm serve <model> --max-num-seqs 512 --gpu-memory-utilization 0.95
Troubleshooting
Common Issues
Q: AclNN_Parameter_Error or dtype errors?
# Check CANN version compatibility
npu-smi info
# Ensure CANN >= 8.0.RC1
# Try different dtype
vllm serve <model> --dtype float16
Q: Out of Memory (OOM)?
# Reduce max model length
vllm serve <model> --max-model-len 2048
# Lower memory utilization
vllm serve <model> --gpu-memory-utilization 0.8
# Reduce concurrent sequences
vllm serve <model> --max-num-seqs 128
Q: Model loading fails?
# Check model path
ls /path/to/model
# Verify tokenizer
python -c "from transformers import AutoTokenizer; tok = AutoTokenizer.from_pretrained('/path/to/model'); print('OK')"
# Use trust_remote_code for custom models
vllm serve <model> --trust-remote-code
Q: Slow inference?
# Enable bfloat16 for faster compute
vllm serve <model> --dtype bfloat16
# Adjust block size
vllm serve <model> --block-size 256
# Enable prefix caching
vllm serve <model> --enable-prefix-caching
Q: API server connection refused?
# Check server is running
curl http://localhost:8000/health
# Verify port is not in use
lsof -i :8000
# Use explicit host/port
vllm serve <model> --host 0.0.0.0 --port 8000
Environment Variables
# Required: Set multiprocessing method for vLLM-Ascend
export VLLM_WORKER_MULTIPROC_METHOD=spawn
# Set Ascend device IDs
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
# Debug logging
export VLLM_LOGGING_LEVEL=DEBUG
# Disable lazy initialization (for debugging)
export VLLM_ASCEND_LAZY_INIT=0
Scripts
scripts/benchmark_throughput.py- Throughput benchmarkscripts/benchmark_latency.py- Latency benchmarkscripts/start_server.sh- Server startup template
References
- references/deployment.md - Deployment patterns and best practices
- references/supported-models.md - Complete model support matrix
- references/api-reference.md - API endpoint documentation
Related Skills
- msmodelslim - Model quantization for vLLM-Ascend
- ascend-docker - Docker container setup for Ascend
- npu-smi - NPU device management
- hccl-test - HCCL performance testing for multi-NPU
Official References
- vLLM-Ascend Documentation: https://docs.vllm.ai/projects/ascend/en/latest/
- vLLM Documentation: https://docs.vllm.ai/
- Huawei Ascend: https://www.hiascend.com/document
- GitHub Repository: https://github.com/vllm-project/vllm-ascend
Weekly Installs
22
Repository
ascend-ai-codin…d-skillsGitHub Stars
22
First Seen
14 days ago
Security Audits
Installed on
gemini-cli22
github-copilot22
codex22
amp22
cline22
kimi-cli22