torchserve
Overview
TorchServe is a flexible and easy-to-use tool for serving PyTorch models. It provides capabilities for packaging models, scaling workers based on hardware availability, and managing multiple model versions via a REST/gRPC API.
When to Use
Use TorchServe when you need a production-ready inference server that handles multi-GPU load balancing, request batching, and custom preprocessing/postprocessing logic via Python handlers.
Decision Tree
- Do you need custom logic for image resizing or JSON parsing before model inference?
- OVERRIDE:
preprocess()in a class inheriting fromBaseHandler.
- OVERRIDE:
- Do you have multiple GPUs available?
- RELY: On TorchServe's round-robin assignment; check the
gpu_idin the handler context.
- RELY: On TorchServe's round-robin assignment; check the
- Do you want to deploy to a system with limited resources?
- CAUTION: TorchServe is in limited maintenance; check environment compatibility.
Workflows
-
Packaging and Serving a Model
- Write a custom handler or use a default one (e.g., 'image_classifier').
- Use
torch-model-archiverto package the model, weights, and handler into a.marfile. - Start TorchServe specifying the model store and the initial models to load.
- Test the endpoint using
curlor a gRPC client.
-
Customizing Inference Logic
- Define a class inheriting from
BaseHandler. - Override
preprocess()to handle incoming JSON/Image data. - Override
inference()orpostprocess()to customize output formatting. - Package this script as the
--handlerin the model archiver.
- Define a class inheriting from
-
Scaling Inference Capacity
- Use the Management API (typically on port 8081) to adjust the number of workers.
- Send a
PUTrequest to/models/{model_name}?min_worker=N. - Monitor logs to ensure new workers are successfully initialized on the available hardware.
Non-Obvious Insights
- A/B Testing: TorchServe naturally supports multiple model versions simultaneously, making it trivial to perform A/B testing by routing requests to different model endpoints.
- GPU Round-Robin: Workers are assigned GPUs in a round-robin fashion. Handlers must use the
gpu_idprovided in thecontextto ensure the model is loaded onto the correct physical device. - The MAR Format: The Model Archive (
.mar) file is a self-contained ZIP that includes the model definition, state dictionary, and the handler script, ensuring that the deployment environment exactly matches the development environment.
Evidence
- "Archive the model by using the model archiver: torch-model-archiver --model-name densenet161 --version 1.0..." (https://pytorch.org/serve/getting_started.html)
- "In case of multiple GPUs TorchServe selects the gpu device in round-robin fashion and passes on this device id to the model handler in context." (https://pytorch.org/serve/custom_service.html)
Scripts
scripts/torchserve_tool.py: Skeleton for a custom TorchServe handler.scripts/torchserve_tool.js: Script to send inference requests to a running TorchServe instance.
Dependencies
- torchserve
- torch-model-archiver
References
More from cuba6112/skillfactory
ollama-rag
Build RAG systems with Ollama local + cloud models. Latest cloud models include DeepSeek-V3.2 (GPT-5 level), Qwen3-Coder-480B (1M context), MiniMax-M2. Use for document Q&A, knowledge bases, and agentic RAG. Covers LangChain, LlamaIndex, ChromaDB, and embedding models.
17unsloth-sft
Supervised fine-tuning using SFTTrainer, instruction formatting, and multi-turn dataset preparation with triggers like sft, instruction tuning, chat templates, sharegpt, alpaca, conversation_extension, and SFTTrainer.
6torchaudio
Audio signal processing library for PyTorch. Covers feature extraction (spectrograms, mel-scale), waveform manipulation, and GPU-accelerated data augmentation techniques. (torchaudio, melscale, spectrogram, pitchshift, specaugment, waveform, resample)
5pytorch-onnx
Exporting PyTorch models to ONNX format for cross-platform deployment. Includes handling dynamic axes, graph optimization in ONNX Runtime, and INT8 model quantization. (onnx, onnxruntime, torch.onnx.export, dynamic_axes, constant-folding, edge-deployment)
5unsloth-lora
Configuring and optimizing 16-bit Low-Rank Adaptation (LoRA) and Rank-Stabilized LoRA (rsLoRA) for efficient LLM fine-tuning using triggers like lora, qlora, rslora, rank selection, lora_alpha, lora_dropout, and target_modules.
4pytorch-quantization
Techniques for model size reduction and inference acceleration using INT8 quantization, including Post-Training Quantization (PTQ) and Quantization Aware Training (QAT). (quantization, int8, qat, fbgemm, qnnpack, ptq, dequantize)
3