modal-llm-serving
SKILL.md
Modal LLM Serving
Quick Start
- Verify the actual local Modal environment before writing code.
modal --version
python -c "import modal,sys; print(modal.__version__); print(sys.executable)"
modal profile current
- Do not assume the default
pythoninterpreter matches the environment behind themodalCLI.
- Confirm that the request is really about self-hosted open-weight text generation on Modal.
- Standard online API
- Cold-start-sensitive vLLM deployment
- Low-latency interactive serving
- High-throughput or offline batch text inference
- Read references/performance-playbook.md and then exactly one primary reference.
- Standard online API: references/vllm-online-serving.md
- Cold starts: references/vllm-cold-starts.md
- Low latency: references/sglang-low-latency.md
- Throughput or batch inference: references/vllm-throughput.md
- Default to vLLM plus
@modal.web_serverunless the user explicitly optimizes for lowest latency or offline throughput. - Ground every implementation in the actual workload: target latency or throughput, model size and precision, GPU type and count, region, concurrency target, and cold-start tolerance.
Choose the Workflow
- Use vLLM with
@modal.web_serverfor the default OpenAI-compatible serving path. Read references/vllm-online-serving.md. - Use vLLM with memory snapshots and a sleep or wake flow only when cold-start latency is a first-class requirement. Read references/vllm-cold-starts.md.
- Use SGLang with
modal.experimental.http_server, explicit region selection, and sticky routing when the user cares most about latency. Read references/sglang-low-latency.md. - Use the vLLM Python
LLMinterface inside@app.clsor another batch worker when the task is about tokens per second or tokens per dollar rather than per-request HTTP behavior. Read references/vllm-throughput.md.
Default Rules
- Pin model revisions when pulling from Hugging Face or another mutable registry.
- Cache model weights in one Modal Volume and engine compilation artifacts in another when the runtime produces them.
- Set
HF_XET_HIGH_PERFORMANCE=1for Hub downloads unless the environment has a specific reason not to. - Include a readiness check before reporting success. Add a smoke-test path such as a
local_entrypointor a small client that hits/healthand one OpenAI-compatible request. - Treat
max_inputs,target_inputs, tensor parallelism, and related knobs as workload-specific. Start conservative and benchmark before increasing them. - Expose only the ports and endpoints the task actually needs.
- Keep one serving pattern per file unless the user explicitly asks for a comparison artifact or benchmark harness.
- Use SGLang only when lowest latency is the explicit objective and the extra setup is justified.
- Use the vLLM Python
LLMinterface only for offline or batch inference that does not need per-request HTTP behavior. - Use snapshot-based cold-start reduction only when startup latency matters enough to justify extra operational complexity.
- Keep the scope on self-hosted text generation engines. Do not stretch this skill to cover embeddings, generic
transformerspipelines, diffusion inference, or purely hosted-API usage. - If the task is really about training or post-training, stop and use
modal-finetuning. - If the task is really about detached job orchestration, retries, or
.mapand.spawn, stop and usemodal-batch-processing. - If the task is really about isolated interactive execution or sandbox lifecycle, stop and use
modal-sandbox.
Validate
- Run
npx skills add . --listafter editing the package metadata or skill descriptions. - Keep
evals/evals.jsonandevals/trigger-evals.jsonaligned with the actual workflow boundary of the skill. - Run
python3 -m py_compile skills/modal-llm-serving/scripts/qwen3_throughput.pywhen changing the throughput artifact.
References
- Read references/performance-playbook.md first for objective-setting, engine selection, and tuning priorities.
- Read references/vllm-online-serving.md for the default HTTP serving path.
- Read references/vllm-cold-starts.md only when cold-start reduction is worth snapshot complexity.
- Read references/sglang-low-latency.md only when the task explicitly optimizes for low latency.
- Read references/vllm-throughput.md only when the workload is throughput or batch oriented.
- Read references/example-patterns.md for compact serving templates and adaptation paths.
- Read references/troubleshooting.md for common serving failures and recovery steps.
- Adapt scripts/qwen3_throughput.py for throughput-oriented batch workloads with a pinned model and local benchmark entrypoint.
Weekly Installs
5
Repository
jamesrobmccall/…l_skillsGitHub Stars
2
First Seen
10 days ago
Security Audits
Installed on
claude-code5
opencode4
gemini-cli4
github-copilot4
codex4
kimi-cli4