torchtext
Overview
TorchText is a legacy library for NLP in PyTorch. While it is in a maintenance phase, it remains a common tool for handling classic NLP datasets and building vocabularies via DataPipes.
When to Use
Use TorchText for maintaining legacy NLP projects or when utilizing its built-in DataPipe-based datasets. For new projects, transitioning to native PyTorch or other modern NLP libraries is recommended.
Decision Tree
- Are you starting a new NLP project?
- CONSIDER: Using Hugging Face or native PyTorch instead of TorchText.
- Do you need a high-performance tokenizer for production?
- USE:
RegexTokenizerand compile it withtorch.jit.script.
- USE:
- Are you using DataPipes with multiple workers?
- ENSURE: Use a proper
worker_init_fnin theDataLoaderto avoid data duplication.
- ENSURE: Use a proper
Workflows
-
Building a Text Processing Pipeline
- Initialize a tokenizer (e.g.,
BERTTokenizer). - Construct a
Vocabobject usingbuild_vocab_from_iteratorfrom a dataset. - Create a pipeline using
transforms.Sequentialcontaining: Tokenizer -> VocabTransform -> AddToken -> Truncate -> ToTensor. - Pass raw strings through the pipeline to get padded tensors.
- Initialize a tokenizer (e.g.,
-
Using Built-in NLP Datasets
- Import a dataset from
torchtext.datasets(e.g., IMDB, AG_NEWS). - Initialize the
DataPipefor the desired split ('train', 'test'). - Setup a
DataLoaderwithshuffle=Trueand a properworker_init_fn. - Iterate through the
DataPipeto get(label, text)pairs.
- Import a dataset from
-
Custom Regex Tokenization
- Define a list of regex patterns and their replacements.
- Instantiate
RegexTokenizerwith the patterns. - Optionally use
torch.jit.scriptto compile the tokenizer for production. - Apply the tokenizer to raw strings to generate tokens.
Non-Obvious Insights
- Maintenance Status: Development of TorchText stopped as of April 2024 (v0.18), marking it as a legacy library.
- Data Duplication Risk: DataPipe-based datasets require explicit handling in the
DataLoader(via worker initialization) to ensure that multiple workers don't serve the same data shards. - Inference Speed: Many transforms like
BERTTokenizerare reimplemented in TorchScript, allowing for high-performance inference without a full Python runtime.
Evidence
- "Warning TorchText development is stopped and the 0.18 release (April 2024) will be the last stable release." (https://pytorch.org/text/stable/index.html)
- "RegexTokenizer: Regex tokenizer for a string sentence that applies all regex replacements... backed by the C++ RE2 engine." (https://pytorch.org/text/stable/transforms.html)
Scripts
scripts/torchtext_tool.py: Example of building a vocabulary and tokenizer pipeline.scripts/torchtext_tool.js: Node.js interface for invoking TorchText pipelines.
Dependencies
- torchtext
- torch
- torchdata
References
More from cuba6112/skillfactory
ollama-rag
Build RAG systems with Ollama local + cloud models. Latest cloud models include DeepSeek-V3.2 (GPT-5 level), Qwen3-Coder-480B (1M context), MiniMax-M2. Use for document Q&A, knowledge bases, and agentic RAG. Covers LangChain, LlamaIndex, ChromaDB, and embedding models.
17unsloth-sft
Supervised fine-tuning using SFTTrainer, instruction formatting, and multi-turn dataset preparation with triggers like sft, instruction tuning, chat templates, sharegpt, alpaca, conversation_extension, and SFTTrainer.
6torchaudio
Audio signal processing library for PyTorch. Covers feature extraction (spectrograms, mel-scale), waveform manipulation, and GPU-accelerated data augmentation techniques. (torchaudio, melscale, spectrogram, pitchshift, specaugment, waveform, resample)
5pytorch-onnx
Exporting PyTorch models to ONNX format for cross-platform deployment. Includes handling dynamic axes, graph optimization in ONNX Runtime, and INT8 model quantization. (onnx, onnxruntime, torch.onnx.export, dynamic_axes, constant-folding, edge-deployment)
5unsloth-lora
Configuring and optimizing 16-bit Low-Rank Adaptation (LoRA) and Rank-Stabilized LoRA (rsLoRA) for efficient LLM fine-tuning using triggers like lora, qlora, rslora, rank selection, lora_alpha, lora_dropout, and target_modules.
4pytorch-quantization
Techniques for model size reduction and inference acceleration using INT8 quantization, including Post-Training Quantization (PTQ) and Quantization Aware Training (QAT). (quantization, int8, qat, fbgemm, qnnpack, ptq, dequantize)
3