club-3090-llm-serving
Installation
SKILL.md
club-3090 LLM Serving
Skill by ara.so — Daily 2026 Skills collection.
Community recipes for serving modern LLMs on RTX 3090 (24 GB) hardware. Supports vLLM, llama.cpp, and SGLang engines with validated Docker Compose configs exposing an OpenAI-compatible API on localhost:8020. Currently ships Qwen3.6-27B configs for 1× and 2× cards.
Engine Decision Matrix
| Need | Engine | Why |
|---|---|---|
| Max throughput (code/chat) | vLLM dual | 89–127 TPS, MTP n=3, vision, tools |
| Full 262K context, no crashes | llama.cpp single | No prefill cliffs, stable tool-use |
| 4 concurrent streams @ 262K | vLLM dual turbo | Stream isolation, full feature stack |
| Single card, moderate ctx | vLLM default | ~89 TPS, easiest setup |
SGLang is currently blocked on Qwen3.6-27B — see models/qwen3.6-27b/sglang/README.md.