transformers

SKILL.md

Hugging Face Transformers - Modern AI Models

Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. It reduces compute costs and carbon footprint by allowing researchers to reuse models instead of training from scratch.

When to Use

  • Natural Language Processing (Summarization, Translation, Named Entity Recognition).
  • Scientific Sequence Analysis (Protein folding, DNA/RNA sequence modeling).
  • Chemical Property Prediction (Using molecular strings like SMILES).
  • Computer Vision (Vision Transformers - ViT, Image Classification).
  • Time Series Forecasting with foundation models.
  • Fine-tuning Large Language Models (LLMs) on domain-specific scientific literature.
  • Multimodal tasks (Document AI, Visual Question Answering).

Reference Documentation

Official docs: https://huggingface.co/docs/transformers/
Model Hub: https://huggingface.co/models
Search patterns: pipeline, AutoModel, AutoTokenizer, Trainer, PEFT (Parameter-Efficient Fine-Tuning)

Core Principles

The "Auto" Classes

Hugging Face uses "Auto" classes (AutoModel, AutoTokenizer) that automatically infer the correct architecture from the model name/path. This makes code highly portable.

Tokenization

Before data enters a model, it must be converted into numerical tokens. The Tokenizer handles this, including padding, truncation, and special tokens (like [CLS], [SEP]).

Pipelines

The simplest way to use a model. It abstracts away tokenization, model execution, and post-processing into a single pipe(data) call.

Quick Reference

Installation

pip install transformers datasets tokenizers
# Requires a backend (PyTorch or JAX)
pip install torch

Standard Imports

from transformers import pipeline, AutoModel, AutoTokenizer, TrainingArguments, Trainer
import torch

Basic Pattern - Using a Pretrained Pipeline

from transformers import pipeline

# 1. Initialize a pipeline (automatically downloads model)
classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

# 2. Run inference
results = classifier("The molecular structure of this compound is fascinating.")
print(results)

Critical Rules

✅ DO

  • Use the Auto Classes - Always prefer AutoTokenizer.from_pretrained() and AutoModel.from_pretrained() for flexibility.
  • Set the Device - Explicitly set device=0 (for CUDA) or device="mps" (for Mac) in pipelines to ensure GPU acceleration.
  • Cache Models - Models are large. Use the HF_HOME environment variable to manage where models are stored on disk.
  • Handle Truncation - Most models have a maximum sequence length (usually 512). Always use truncation=True in tokenizers.
  • Use Datasets Library - For training, use the datasets library to handle data loading and streaming without filling RAM.
  • Save Tokenizers with Models - When fine-tuning, always save the tokenizer alongside the model to ensure consistency.

❌ DON'T

  • Load Models in a Loop - Loading a model takes seconds and GBs of RAM. Load once, reuse many times.
  • Upload Private Data - Be careful when using models that might send data to an API (though transformers is mostly local execution).
  • Ignore Padding - For batch processing, ensure padding=True so all sequences in the batch have the same length.
  • Use Wrong Model for Task - A "BERT" model is for understanding; "GPT" is for generation. Use the right architecture.

Anti-Patterns (NEVER)

from transformers import AutoModel, AutoTokenizer

# ❌ BAD: Re-initializing the model inside a function called frequently
def get_prediction(text):
    model = AutoModel.from_pretrained("bert-base-uncased") # ❌ SLOW & RAM HEAVY
    return model(text)

# ✅ GOOD: Load once globally or in a class
model = AutoModel.from_pretrained("bert-base-uncased")
def get_prediction(text):
    return model(text)

# ❌ BAD: Manual string splitting for "tokens"
# tokens = text.split(" ") # ❌ Not compatible with model vocabulary

# ✅ GOOD: Use the model's specific tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
inputs = tokenizer(text, return_tensors="pt")

# ❌ BAD: Forgetting to move model to GPU
# model = AutoModel.from_pretrained("...")
# output = model(inputs.to("cuda")) # ❌ Error: Model is on CPU!

Tokenization Deep Dive

Preparing Data for Models

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

texts = ["Science is cool.", "Quantum physics is hard."]

# Batch encoding
inputs = tokenizer(
    texts, 
    padding=True, 
    truncation=True, 
    max_length=128, 
    return_tensors="pt" # Returns PyTorch tensors
)

print(inputs['input_ids'].shape) # (batch_size, seq_len)

The Trainer API

Simplified Training Loop

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    evaluation_strategy="epoch",
    logging_dir="./logs",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
)

# trainer.train()

Scientific Applications

1. Protein Sequence Analysis (ESM)

# ESM-2 is a powerful protein language model
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")
model = AutoModel.from_pretrained("facebook/esm2_t6_8M_UR50D")

protein_seq = "MAPLRKTYLLG"
inputs = tokenizer(protein_seq, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    # The 'last_hidden_state' represents a "feature vector" for each amino acid
    embeddings = outputs.last_hidden_state

2. Chemical Property Prediction (SMILES)

# Using a model trained on molecular strings
pipe = pipeline("text-classification", model="seyonec/ChemBERTa-zinc-base-v1")

smiles = "CC(=O)Oc1ccccc1C(=O)O" # Aspirin
result = pipe(smiles)
print(f"Prediction: {result}")

3. Named Entity Recognition (NER) for Papers

# Extracting genes, proteins, or chemicals from text
ner_pipe = pipeline("ner", model="dslim/bert-base-NER")
text = "The expression of the BRCA1 gene was observed in the sample."
entities = ner_pipe(text)

Performance and Efficiency

1. Quantization (bitsandbytes)

Running large models on consumer GPUs by reducing precision (8-bit or 4-bit).

from transformers import BitsAndBytesConfig

# Load model in 4-bit precision
quant_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModel.from_pretrained("model_name", quantization_config=quant_config)

2. Using pipeline with GPU

# 'device=0' targets the first CUDA device
pipe = pipeline("translation_en_to_fr", model="t5-base", device=0)

Common Pitfalls and Solutions

"Out of Memory" (OOM) on GPU

# ❌ Problem: Batch size is too large for GPU RAM
# ✅ Solution: 
# 1. Reduce 'per_device_train_batch_size'
# 2. Use 'gradient_accumulation_steps' to keep effective batch size
# 3. Use 'fp16=True' in TrainingArguments

Model Output is a Dictionary, not a Tensor

# ❌ Problem: outputs[0] works, but is confusing
# ✅ Solution: Access by name
outputs = model(**inputs)
hidden_states = outputs.last_hidden_state

Slow Tokenization

# ✅ Solution: Use "Fast" tokenizers (written in Rust, usually default)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)

Hugging Face Transformers has democratized AI for the scientific community. By providing a unified interface to the world's most powerful models, it allows researchers to spend less time on engineering and more time on discovering insights from data.

Weekly Installs
5
GitHub Stars
6
First Seen
Feb 8, 2026
Installed on
opencode4
claude-code4
github-copilot4
codex4
kimi-cli4
gemini-cli4