HuggingFace Transformers

Access thousands of pre-trained models for NLP, vision, audio, and multimodal tasks.

When to Use

Quick inference with pipelines
Text generation, classification, QA, NER
Image classification, object detection
Fine-tuning on custom datasets
Loading pre-trained models from HuggingFace Hub

Pipeline Tasks

NLP Tasks

Task	Pipeline Name	Output
Text Generation	`text-generation`	Completed text
Classification	`text-classification`	Label + confidence
Question Answering	`question-answering`	Answer span
Summarization	`summarization`	Shorter text
Translation	`translation_en_to_fr`	Translated text
NER	`ner`	Entity spans + types
Fill Mask	`fill-mask`	Predicted tokens

Vision Tasks

Task	Pipeline Name	Output
Image Classification	`image-classification`	Label + confidence
Object Detection	`object-detection`	Bounding boxes
Image Segmentation	`image-segmentation`	Pixel masks

Audio Tasks

Task	Pipeline Name	Output
Speech Recognition	`automatic-speech-recognition`	Transcribed text
Audio Classification	`audio-classification`	Label + confidence

Model Loading Patterns

Auto Classes

Class	Use Case
AutoModel	Base model (embeddings)
AutoModelForCausalLM	Text generation (GPT-style)
AutoModelForSeq2SeqLM	Encoder-decoder (T5, BART)
AutoModelForSequenceClassification	Classification head
AutoModelForTokenClassification	NER, POS tagging
AutoModelForQuestionAnswering	Extractive QA

Key concept: Always use Auto classes unless you need a specific architecture—they handle model detection automatically.

Generation Parameters

Parameter	Effect	Typical Values
max_new_tokens	Output length	50-500
temperature	Randomness (0=deterministic)	0.1-1.0
top_p	Nucleus sampling threshold	0.9-0.95
top_k	Limit vocabulary per step	50
num_beams	Beam search (disable sampling)	4-8
repetition_penalty	Discourage repetition	1.1-1.3

Key concept: Higher temperature = more creative but less coherent. For factual tasks, use low temperature (0.1-0.3).

Memory Management

Device Placement Options

Option	When to Use
device_map="auto"	Let library decide GPU allocation
device_map="cuda:0"	Specific GPU
device_map="cpu"	CPU only

Quantization Options

Method	Memory Reduction	Quality Impact
8-bit	~50%	Minimal
4-bit	~75%	Small for most tasks
GPTQ	~75%	Requires calibration
AWQ	~75%	Activation-aware

Key concept: Use torch_dtype="auto" to automatically use the model's native precision (often bfloat16).

Fine-Tuning Concepts

Trainer Arguments

Argument	Purpose	Typical Value
num_train_epochs	Training passes	3-5
per_device_train_batch_size	Samples per GPU	8-32
learning_rate	Step size	2e-5 for fine-tuning
weight_decay	Regularization	0.01
warmup_ratio	LR warmup	0.1
evaluation_strategy	When to eval	"epoch" or "steps"

Fine-Tuning Strategies

Strategy	Memory	Quality	Use Case
Full fine-tuning	High	Best	Small models, enough data
LoRA	Low	Good	Large models, limited GPU
QLoRA	Very Low	Good	7B+ models on consumer GPU
Prefix tuning	Low	Moderate	When you can't modify weights

Tokenization Concepts

Parameter	Purpose
padding	Make sequences same length
truncation	Cut sequences to max_length
max_length	Maximum tokens (model-specific)
return_tensors	Output format ("pt", "tf", "np")

Key concept: Always use the tokenizer that matches the model—different models use different vocabularies.

Best Practices

Practice	Why
Use pipelines for inference	Handles preprocessing automatically
Use device_map="auto"	Optimal GPU memory distribution
Batch inputs	Better throughput
Use quantization for large models	Run 7B+ on consumer GPUs
Match tokenizer to model	Vocabularies differ between models
Use Trainer for fine-tuning	Built-in best practices

Resources

Docs: https://huggingface.co/docs/transformers
Model Hub: https://huggingface.co/models
Course: https://huggingface.co/course

transformers