high-performance-inference
Installation
SKILL.md
High-Performance Inference
Optimize LLM inference for production with vLLM 0.14.x, quantization, and speculative decoding.
vLLM 0.14.0 (Jan ): PyTorch 2.9.0, CUDA 12.9, AttentionConfig API, Python 3.12+ recommended.
Overview
- Deploying LLMs with low latency requirements
- Reducing GPU memory for larger models
- Maximizing throughput for batch inference
- Edge/mobile deployment with constrained resources
- Cost optimization through efficient hardware utilization