skills/yonatangross/skillforge-claude-plugin/high-performance-inference

high-performance-inference

SKILL.md

High-Performance Inference

Optimize LLM inference for production with vLLM 0.14.x, quantization, and speculative decoding.

vLLM 0.14.0 (Jan 2026): PyTorch 2.9.0, CUDA 12.9, AttentionConfig API, Python 3.12+ recommended.

Overview

  • Deploying LLMs with low latency requirements
  • Reducing GPU memory for larger models
  • Maximizing throughput for batch inference
  • Edge/mobile deployment with constrained resources
  • Cost optimization through efficient hardware utilization

Quick Reference

# Basic vLLM server
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4 \
    --max-model-len 8192

# With quantization + speculative decoding
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
    --quantization awq \
    --speculative-config '{"method": "ngram", "num_speculative_tokens": 5}' \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.9

vLLM 0.14.x Key Features

Feature Benefit
PagedAttention Up to 24x throughput via efficient KV cache
Continuous Batching Dynamic request batching for max utilization
CUDA Graphs Fast model execution with graph capture
Tensor Parallelism Scale across multiple GPUs
Prefix Caching Reuse KV cache for shared prefixes
AttentionConfig New API replacing VLLM_ATTENTION_BACKEND env
Semantic Router vLLM SR v0.1 "Iris" for intelligent LLM routing

Python vLLM Integration

from vllm import LLM, SamplingParams

# Initialize with optimization flags
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    quantization="awq",
    tensor_parallel_size=2,
    gpu_memory_utilization=0.9,
    enable_prefix_caching=True,
)

# Sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=1024,
)

# Generate
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs[0].text)

Quantization Methods

Method Bits Memory Savings Speed Quality
FP16 16 Baseline Baseline Best
INT8 8 50% +10-20% Very Good
AWQ 4 75% +20-40% Good
GPTQ 4 75% +15-30% Good
FP8 8 50% +30-50% Very Good

When to Use Each:

  • FP16: Maximum quality, sufficient memory
  • INT8/FP8: Balance of quality and efficiency
  • AWQ: Best 4-bit quality, activation-aware
  • GPTQ: Faster quantization, good quality

Speculative Decoding

Accelerate generation by predicting multiple tokens:

# N-gram based (no extra model)
speculative_config = {
    "method": "ngram",
    "num_speculative_tokens": 5,
    "prompt_lookup_max": 5,
    "prompt_lookup_min": 2,
}

# Draft model (higher quality)
speculative_config = {
    "method": "draft_model",
    "draft_model": "meta-llama/Llama-3.2-1B-Instruct",
    "num_speculative_tokens": 3,
}

Expected Gains: 1.5-2.5x throughput for autoregressive tasks.

Key Decisions

Decision Recommendation
Quantization AWQ for 4-bit, FP8 for H100/H200
Batch size Dynamic via continuous batching
GPU memory 0.85-0.95 utilization
Parallelism Tensor parallel across GPUs
KV cache Enable prefix caching for shared contexts

Common Mistakes

  • Using GPTQ without calibration data (poor quality)
  • Over-allocating GPU memory (OOM on peak loads)
  • Ignoring warmup requests (cold start latency)
  • Not benchmarking actual workload patterns
  • Mixing quantization with incompatible features

Performance Benchmarking

from vllm import LLM, SamplingParams
import time

def benchmark_throughput(llm, prompts, sampling_params, num_runs=3):
    """Benchmark tokens per second."""
    total_tokens = 0
    total_time = 0

    for _ in range(num_runs):
        start = time.perf_counter()
        outputs = llm.generate(prompts, sampling_params)
        elapsed = time.perf_counter() - start

        tokens = sum(len(o.outputs[0].token_ids) for o in outputs)
        total_tokens += tokens
        total_time += elapsed

    return total_tokens / total_time  # tokens/sec

Advanced Patterns

See references/ for:

  • vLLM Deployment: PagedAttention, batching, production config
  • Quantization Guide: AWQ, GPTQ, INT8, FP8 comparison
  • Speculative Decoding: Draft models, n-gram, throughput tuning
  • Edge Deployment: Mobile, resource-constrained optimization

Related Skills

  • llm-streaming - Streaming token responses
  • function-calling - Tool use with inference
  • ollama-local - Local inference with Ollama
  • prompt-caching - Reduce redundant computation
  • semantic-caching - Cache full responses

Capability Details

vllm-deployment

Keywords: vllm, inference server, deploy, serve, production Solves:

  • Deploy LLMs with vLLM for production
  • Configure tensor parallelism and batching
  • Optimize GPU memory utilization

quantization

Keywords: quantize, AWQ, GPTQ, INT8, FP8, compress, reduce memory Solves:

  • Reduce model memory footprint
  • Choose appropriate quantization method
  • Maintain quality with lower precision

speculative-decoding

Keywords: speculative, draft model, faster generation, predict tokens Solves:

  • Accelerate autoregressive generation
  • Configure draft models or n-gram speculation
  • Tune speculative token count

edge-inference

Keywords: edge, mobile, embedded, constrained, optimization Solves:

  • Deploy on resource-constrained devices
  • Optimize for mobile/edge hardware
  • Balance quality and resource usage

throughput-optimization

Keywords: throughput, latency, performance, benchmark, optimize Solves:

  • Maximize requests per second
  • Reduce time to first token
  • Benchmark and tune performance
Weekly Installs
4
GitHub Stars
95
First Seen
Jan 21, 2026
Installed on
claude-code3
opencode2
antigravity2
gemini-cli2
windsurf1
trae1