torchaudio
Overview
TorchAudio provides signal processing tools for PyTorch, enabling users to treat audio processing as part of the neural network graph. This allow transforms to be run on GPUs and handled via nn.Sequential pipelines.
When to Use
Use TorchAudio for converting raw audio waveforms into features like Mel Spectrograms, performing data augmentation (SpecAugment), or when high-performance resampling is required.
Decision Tree
- Do you need to transform many audio files quickly?
- MOVE: The transform module to GPU using
.to('cuda').
- MOVE: The transform module to GPU using
- Are you training an Automatic Speech Recognition (ASR) model?
- USE: SpecAugment (TimeMasking, FrequencyMasking) on the spectrogram.
- Do you need to align text to audio?
- USE: The
forced_alignfunctional API with a Wav2Vec2 model.
- USE: The
Workflows
-
Audio Feature Extraction Pipeline
- Load the waveform as a PyTorch tensor.
- Apply
Resampleto match the target frequency (e.g., 16000Hz). - Use
Spectrogramto convert to a power/amplitude scale. - Convert the spectrogram to mel-scale using
MelScale. - Optionally apply
AmplitudeToDBto get decibel values.
-
GPU-Accelerated Data Augmentation
- Define an augmentation pipeline using
nn.Sequential(e.g.,TimeStretch,FrequencyMasking). - Move the pipeline to the GPU (
.to('cuda')). - Process batches of spectrograms directly on the device to avoid CPU bottlenecks.
- Define an augmentation pipeline using
-
Pitch and Speed Perturbation
- Use
torchaudio.transforms.PitchShiftfor pitch changes without affecting duration. - Apply
torchaudio.transforms.Speedfor speed changes (which also affects pitch). - Iterate through a range of factors to create a diverse augmented dataset for speech recognition.
- Use
Non-Obvious Insights
- GPU Efficiency: Moving audio transformations like
MelScaleto the GPU allows the entire feature extraction process to happen in parallel with model training, eliminating CPU data-loading bottlenecks. - ASR Robustness: SpecAugment techniques (Time/Frequency Masking) should be applied to the spectrogram, not the raw waveform, as they are designed to simulate missing acoustic information in the frequency domain.
- Library Evolution: Starting with version 2.8, TorchAudio has entered maintenance, with some low-level decoding capabilities consolidated into the
TorchCodecproject.
Evidence
- "Transforms are implemented using torch.nn.Module... common ways to build a processing pipeline are to chain Modules together using torch.nn.Sequential." (https://pytorch.org/audio/stable/transforms.html)
- "PitchShift: Shift the pitch of a waveform by n_steps steps." (https://pytorch.org/audio/stable/transforms.html)
Scripts
scripts/torchaudio_tool.py: Utility for building an audio feature pipeline.scripts/torchaudio_tool.js: Node.js wrapper for processing audio files via TorchAudio.
Dependencies
- torchaudio
- torch
References
More from cuba6112/skillfactory
ollama-rag
Build RAG systems with Ollama local + cloud models. Latest cloud models include DeepSeek-V3.2 (GPT-5 level), Qwen3-Coder-480B (1M context), MiniMax-M2. Use for document Q&A, knowledge bases, and agentic RAG. Covers LangChain, LlamaIndex, ChromaDB, and embedding models.
17unsloth-sft
Supervised fine-tuning using SFTTrainer, instruction formatting, and multi-turn dataset preparation with triggers like sft, instruction tuning, chat templates, sharegpt, alpaca, conversation_extension, and SFTTrainer.
6pytorch-onnx
Exporting PyTorch models to ONNX format for cross-platform deployment. Includes handling dynamic axes, graph optimization in ONNX Runtime, and INT8 model quantization. (onnx, onnxruntime, torch.onnx.export, dynamic_axes, constant-folding, edge-deployment)
5unsloth-lora
Configuring and optimizing 16-bit Low-Rank Adaptation (LoRA) and Rank-Stabilized LoRA (rsLoRA) for efficient LLM fine-tuning using triggers like lora, qlora, rslora, rank selection, lora_alpha, lora_dropout, and target_modules.
4pytorch-quantization
Techniques for model size reduction and inference acceleration using INT8 quantization, including Post-Training Quantization (PTQ) and Quantization Aware Training (QAT). (quantization, int8, qat, fbgemm, qnnpack, ptq, dequantize)
3torchvision
Computer vision library for PyTorch featuring pretrained models, advanced image transforms (v2), and utilities for handling complex data types like bounding boxes and masks. (torchvision, transforms, tvtensor, resnet, cutmix, mixup, pretrained models, vision transforms)
3