high-performance-inference

Installation
SKILL.md

High-Performance Inference

Optimize LLM inference for production with vLLM 0.14.x, quantization, and speculative decoding.

vLLM 0.14.0 (Jan ): PyTorch 2.9.0, CUDA 12.9, AttentionConfig API, Python 3.12+ recommended.

Overview

  • Deploying LLMs with low latency requirements
  • Reducing GPU memory for larger models
  • Maximizing throughput for batch inference
  • Edge/mobile deployment with constrained resources
  • Cost optimization through efficient hardware utilization

Quick Reference

Installs
11
GitHub Stars
182
First Seen
Jan 22, 2026
high-performance-inference — yonatangross/orchestkit