pytorch-quantization
Overview
Quantization converts high-precision floating point tensors (FP32) into low-precision integers (INT8). This significantly reduces model size and improves inference speed on supported hardware backends like FBGEMM (x86) and QNNPACK (ARM).
When to Use
Use quantization when deploying models to edge devices (mobile/IoT) or when seeking to reduce cloud inference costs by using INT8-optimized CPU instances.
Decision Tree
- Do you have a representative calibration dataset but no time for training?
- USE: Post-Training Quantization (PTQ).
- Is accuracy drop unacceptable with PTQ?
- USE: Quantization Aware Training (QAT).
- Are you running on an ARM-based mobile device?
- SET:
torch.backends.quantized.engine = 'qnnpack'.
- SET:
Workflows
-
Using a Pre-Quantized Model
- Select a quantized weight enum (e.g.,
ResNet50_QuantizedWeights.DEFAULT). - Instantiate the model with
quantize=True. - Set the model to
.eval()mode. - Apply the specific preprocessing transforms provided by the weights.
- Perform inference using INT8-optimized backends.
- Select a quantized weight enum (e.g.,
-
Manual Tensor Quantization
- Determine the min/max range of your float tensor.
- Calculate scale and zero_point for INT8 representation.
- Apply
torch.quantize_per_tensor()to the float input. - Perform operations on the quantized tensor and dequantize when necessary.
-
Post-Training Quantization Preparation
- Fuse modules (e.g., Conv+BN+ReLU) into single blocks to improve efficiency.
- Insert observers or use prepared models to collect activation statistics on a calibration dataset.
- Convert the model using the backend-specific engine (e.g., 'fbgemm' for server CPUs).
Non-Obvious Insights
- Backend Specificity: Pre-quantized models in TorchVision are optimized for specific backends. A model quantized for FBGEMM may perform poorly on QNNPACK.
- Per-Channel Accuracy: Per-channel quantization is typically more accurate for weights than per-tensor quantization because it accounts for varying distributions across different output channels.
- Learning the Error: Quantization Aware Training (QAT) allows the model to learn and compensate for the quantization error during training, typically resulting in higher accuracy than post-training methods.
Evidence
- "resnet50(weights=weights, quantize=True)" (https://pytorch.org/vision/stable/models.html)
- "torch.quantize_per_tensor converts a float tensor to a quantized tensor with given scale and zero point." (https://pytorch.org/docs/stable/quantization.html)
Scripts
scripts/pytorch-quantization_tool.py: Demo of manual tensor quantization and pre-quantized model loading.scripts/pytorch-quantization_tool.js: Node.js wrapper to invoke quantization conversion scripts.
Dependencies
- torch
- torchvision
References
More from cuba6112/skillfactory
unsloth-sft
Supervised fine-tuning using SFTTrainer, instruction formatting, and multi-turn dataset preparation with triggers like sft, instruction tuning, chat templates, sharegpt, alpaca, conversation_extension, and SFTTrainer.
6unsloth-lora
Configuring and optimizing 16-bit Low-Rank Adaptation (LoRA) and Rank-Stabilized LoRA (rsLoRA) for efficient LLM fine-tuning using triggers like lora, qlora, rslora, rank selection, lora_alpha, lora_dropout, and target_modules.
4torchserve
Model serving engine for PyTorch. Focuses on MAR packaging, custom handlers for preprocessing/inference, and management of multi-GPU worker scaling. (torchserve, mar-file, handler, basehandler, model-archiver, inference-api)
3uv-advanced
Advanced usage of uv, the extremely fast Python package and project manager from Astral. Use this skill when working with uv for project management (uv init, uv add, uv run, uv lock, uv sync), workspaces and monorepos, dependency resolution strategies (universal, platform-specific, constraints, overrides), Docker containerization, PEP 723 inline script metadata, uvx tool execution, Python version management, pip interface migration, pyproject.toml configuration, or any advanced uv workflow. Covers workspaces, resolution strategies, Docker best practices, CI/CD integration, and migration from pip/poetry/pipenv.
2torchtext
Natural Language Processing utilities for PyTorch (Legacy). Includes tokenizers, vocabulary building, and DataPipe-based dataset handling for text processing pipelines. (torchtext, tokenizer, vocab, datapipe, regextokenizer, nlp-pipeline)
2pytorch-core
Core PyTorch fundamentals including tensor operations, autograd, nn.Module architecture, and training loop orchestration. Covers optimizations like pin_memory and lazy module initialization. (pytorch, tensor, autograd, nn.Module, optimizer, training loop, state_dict, pin_memory, lazylinear, requires_grad)
2