24 KiB
name, description
| name | description |
|---|---|
| transformers | Essential toolkit for Hugging Face Transformers library enabling state-of-the-art machine learning across natural language processing, computer vision, audio processing, and multimodal applications. Use this skill for: loading and using pretrained transformer models (BERT, GPT, T5, RoBERTa, DistilBERT, BART, T5, ViT, CLIP, Whisper, Llama, Mistral), implementing text generation and completion, fine-tuning models for custom tasks, text classification and sentiment analysis, question answering and reading comprehension, named entity recognition and token classification, text summarization and translation, image classification and object detection, speech recognition and audio processing, multimodal tasks combining text and images, parameter-efficient fine-tuning with LoRA and adapters, model quantization and optimization, training custom transformer models, implementing chat interfaces and conversational AI, working with tokenizers and text preprocessing, handling model inference and deployment, managing GPU memory and device allocation, implementing custom training loops, using pipelines for quick inference, working with Hugging Face Hub for model sharing, and any machine learning task involving transformer architectures or attention mechanisms. |
Transformers
Overview
Transformers is Hugging Face's flagship library providing unified access to over 1 million pretrained models for machine learning across text, vision, audio, and multimodal domains. The library serves as a standardized model-definition framework compatible with PyTorch, TensorFlow, and JAX, emphasizing ease of use through three core components:
- Pipeline: Simple, optimized inference API for common tasks
- AutoClasses: Automatic model/tokenizer selection from pretrained checkpoints
- Trainer: Full-featured training loop with distributed training, mixed precision, and optimization
The library prioritizes accessibility with pretrained models that reduce computational costs and carbon footprint while providing compatibility across major training frameworks (PyTorch-Lightning, DeepSpeed, vLLM, etc.).
Quick Start with Pipelines
Use pipelines for simple, efficient inference without managing models, tokenizers, or preprocessing manually. Pipelines abstract complexity into a single function call.
Basic Pipeline Usage
from transformers import pipeline
# Text classification
classifier = pipeline("text-classification")
result = classifier("This restaurant is awesome")
# [{'label': 'POSITIVE', 'score': 0.9998}]
# Text generation
generator = pipeline("text-generation", model="meta-llama/Llama-2-7b-hf")
generator("The secret to baking a good cake is", max_length=50)
# Question answering
qa = pipeline("question-answering")
qa(question="What is extractive QA?", context="Extractive QA is...")
# Image classification
img_classifier = pipeline("image-classification")
img_classifier("path/to/image.jpg")
# Automatic speech recognition
transcriber = pipeline("automatic-speech-recognition")
transcriber("audio_file.mp3")
Available Pipeline Tasks
NLP Tasks:
text-classification,token-classification,question-answeringfill-mask,summarization,translationtext-generation,conversationalzero-shot-classification,sentiment-analysis
Vision Tasks:
image-classification,image-segmentation,object-detectiondepth-estimation,image-to-image,zero-shot-image-classification
Audio Tasks:
automatic-speech-recognition,audio-classificationtext-to-audio,zero-shot-audio-classification
Multimodal Tasks:
visual-question-answering,document-question-answeringimage-to-text,zero-shot-object-detection
Pipeline Best Practices
Device Management:
from transformers import pipeline, infer_device
device = infer_device() # Auto-detect best device
pipe = pipeline("text-generation", model="...", device=device)
Batch Processing:
# Process multiple inputs efficiently
results = classifier(["Text 1", "Text 2", "Text 3"])
# Use KeyDataset for large datasets
from transformers.pipelines.pt_utils import KeyDataset
from datasets import load_dataset
dataset = load_dataset("imdb", split="test")
for result in pipe(KeyDataset(dataset, "text")):
print(result)
Memory Optimization:
# Use half-precision for faster inference
pipe = pipeline("text-generation", model="...",
torch_dtype=torch.float16, device="cuda")
Core Components
AutoClasses for Model Loading
AutoClasses automatically select the correct architecture based on pretrained checkpoints.
from transformers import (
AutoModel, AutoTokenizer, AutoConfig,
AutoModelForCausalLM, AutoModelForSequenceClassification
)
# Load any model by checkpoint name
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
# Task-specific model classes
causal_lm = AutoModelForCausalLM.from_pretrained("gpt2")
classifier = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=3
)
# Load with device and dtype optimization
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
device_map="auto", # Automatically distribute across devices
torch_dtype="auto" # Use optimal dtype
)
Key Parameters:
device_map="auto": Optimal device allocation (CPU/GPU/multi-GPU)torch_dtype: Control precision (torch.float16, torch.bfloat16, "auto")trust_remote_code: Enable custom model code (use cautiously)use_fast: Enable Rust-backed fast tokenizers (default True)
Tokenization
Tokenizers convert text to model-compatible tensor inputs.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Basic tokenization
tokens = tokenizer.tokenize("Hello, how are you?")
# ['hello', ',', 'how', 'are', 'you', '?']
# Encoding (text → token IDs)
encoded = tokenizer("Hello, how are you?", return_tensors="pt")
# {'input_ids': tensor([[...]], 'attention_mask': tensor([[...]])}
# Batch encoding with padding and truncation
batch = tokenizer(
["Short text", "This is a much longer text..."],
padding=True, # Pad to longest in batch
truncation=True, # Truncate to model's max length
max_length=512,
return_tensors="pt"
)
# Decoding (token IDs → text)
text = tokenizer.decode(encoded['input_ids'][0])
Special Tokens:
# Access special tokens
tokenizer.pad_token # Padding token
tokenizer.cls_token # Classification token
tokenizer.sep_token # Separator token
tokenizer.mask_token # Mask token (for MLM)
# Add custom tokens
tokenizer.add_tokens(["[CUSTOM]"])
tokenizer.add_special_tokens({'additional_special_tokens': ['[NEW]']})
# Resize model embeddings to match new vocabulary
model.resize_token_embeddings(len(tokenizer))
Image Processors
For vision tasks, use image processors instead of tokenizers.
from transformers import AutoImageProcessor
processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
# Process single image
from PIL import Image
image = Image.open("path/to/image.jpg")
inputs = processor(image, return_tensors="pt")
# Returns: {'pixel_values': tensor([[...]])}
# Batch processing
images = [Image.open(f"img{i}.jpg") for i in range(3)]
inputs = processor(images, return_tensors="pt")
Processors for Multimodal Models
Multimodal models use processors that combine image and text processing.
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("microsoft/git-base")
# Process image + text caption
inputs = processor(
images=image,
text="A description of the image",
return_tensors="pt",
padding=True
)
Model Inference
Basic Inference Pattern
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# Tokenize input
inputs = tokenizer("The future of AI is", return_tensors="pt")
# Generate (for causal LM)
outputs = model.generate(**inputs, max_length=50)
text = tokenizer.decode(outputs[0])
# Or get model outputs directly
outputs = model(**inputs)
logits = outputs.logits # Shape: (batch_size, seq_len, vocab_size)
Text Generation Strategies
For generative models, control generation behavior with parameters:
# Greedy decoding (default)
output = model.generate(inputs, max_length=50)
# Beam search (multiple hypothesis)
output = model.generate(
inputs,
max_length=50,
num_beams=5, # Keep top 5 beams
early_stopping=True
)
# Sampling with temperature
output = model.generate(
inputs,
max_length=50,
do_sample=True,
temperature=0.7, # Lower = more focused, higher = more random
top_k=50, # Sample from top 50 tokens
top_p=0.95 # Nucleus sampling
)
# Streaming generation
from transformers import TextStreamer
streamer = TextStreamer(tokenizer)
model.generate(**inputs, streamer=streamer, max_length=100)
Generation Parameters:
max_length/max_new_tokens: Control output lengthnum_beams: Beam search width (1 = greedy)temperature: Randomness (0.7-1.0 typical)top_k: Sample from top k tokenstop_p: Nucleus sampling thresholdrepetition_penalty: Discourage repetition (>1.0)
Refer to references/generation_strategies.md for detailed information on choosing appropriate strategies.
Training and Fine-Tuning
Training Workflow Overview
- Load dataset → 2. Preprocess → 3. Configure training → 4. Train → 5. Evaluate → 6. Save/Share
Text Classification Example
from transformers import (
AutoTokenizer, AutoModelForSequenceClassification,
TrainingArguments, Trainer, DataCollatorWithPadding
)
from datasets import load_dataset
# 1. Load dataset
dataset = load_dataset("imdb")
# 2. Preprocess
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
def preprocess(examples):
return tokenizer(examples["text"], truncation=True)
tokenized = dataset.map(preprocess, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
# 3. Load model
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=2,
id2label={0: "negative", 1: "positive"},
label2id={"negative": 0, "positive": 1}
)
# 4. Configure training
training_args = TrainingArguments(
output_dir="./results",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
push_to_hub=False,
)
# 5. Train
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["test"],
tokenizer=tokenizer,
data_collator=data_collator,
)
trainer.train()
# 6. Evaluate and save
metrics = trainer.evaluate()
trainer.save_model("./my-finetuned-model")
trainer.push_to_hub() # Share to Hugging Face Hub
Vision Task Fine-Tuning
from transformers import (
AutoImageProcessor, AutoModelForImageClassification,
TrainingArguments, Trainer
)
from datasets import load_dataset
# Load dataset
dataset = load_dataset("food101", split="train[:5000]")
# Image preprocessing
processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
def transform(examples):
examples["pixel_values"] = [
processor(img.convert("RGB"), return_tensors="pt")["pixel_values"][0]
for img in examples["image"]
]
return examples
dataset = dataset.with_transform(transform)
# Load model
model = AutoModelForImageClassification.from_pretrained(
"google/vit-base-patch16-224",
num_labels=101, # 101 food categories
ignore_mismatched_sizes=True
)
# Training (similar pattern to text)
training_args = TrainingArguments(
output_dir="./vit-food101",
remove_unused_columns=False, # Keep image data
eval_strategy="epoch",
save_strategy="epoch",
learning_rate=5e-5,
per_device_train_batch_size=32,
num_train_epochs=3,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=processor,
)
trainer.train()
Sequence-to-Sequence Tasks
For tasks like summarization, translation, use Seq2SeqTrainer:
from transformers import (
AutoTokenizer, AutoModelForSeq2SeqLM,
Seq2SeqTrainingArguments, Seq2SeqTrainer,
DataCollatorForSeq2Seq
)
tokenizer = AutoTokenizer.from_pretrained("t5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
def preprocess(examples):
# Prefix input for T5
inputs = ["summarize: " + doc for doc in examples["text"]]
model_inputs = tokenizer(inputs, max_length=1024, truncation=True)
# Tokenize targets
labels = tokenizer(
examples["summary"],
max_length=128,
truncation=True
)
model_inputs["labels"] = labels["input_ids"]
return model_inputs
tokenized_dataset = dataset.map(preprocess, batched=True)
training_args = Seq2SeqTrainingArguments(
output_dir="./t5-summarization",
eval_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=8,
num_train_epochs=3,
predict_with_generate=True, # Important for seq2seq
)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["test"],
tokenizer=tokenizer,
data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
)
trainer.train()
Important TrainingArguments
TrainingArguments(
# Essential
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=8,
learning_rate=2e-5,
# Evaluation
eval_strategy="epoch", # or "steps"
eval_steps=500, # if eval_strategy="steps"
# Checkpointing
save_strategy="epoch",
save_steps=500,
save_total_limit=2, # Keep only 2 best checkpoints
load_best_model_at_end=True,
metric_for_best_model="accuracy",
# Optimization
gradient_accumulation_steps=4,
warmup_steps=500,
weight_decay=0.01,
max_grad_norm=1.0,
# Mixed Precision
fp16=True, # For Nvidia GPUs
bf16=True, # For Ampere+ GPUs (better)
# Logging
logging_steps=100,
report_to="tensorboard", # or "wandb", "mlflow"
# Memory Optimization
gradient_checkpointing=True,
optim="adamw_torch", # or "adafactor" for memory
# Distributed Training
ddp_find_unused_parameters=False,
)
Refer to references/training_guide.md for comprehensive training patterns and optimization strategies.
Performance Optimization
Model Quantization
Reduce memory footprint while maintaining accuracy:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
# 8-bit quantization
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
load_in_8bit=True,
device_map="auto"
)
# 4-bit quantization (even smaller)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto"
)
Quantization Methods:
- Bitsandbytes: 4/8-bit on-the-fly quantization, supports PEFT fine-tuning
- GPTQ: 2/3/4/8-bit, requires calibration, very fast inference
- AWQ: 4-bit activation-aware, balanced speed/accuracy
Refer to references/quantization.md for detailed comparison and usage patterns.
Training Optimization
# Gradient accumulation (simulate larger batch)
training_args = TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=8, # Effective batch = 4 * 8 = 32
)
# Gradient checkpointing (reduce memory, slower)
training_args = TrainingArguments(
gradient_checkpointing=True,
)
# Mixed precision training
training_args = TrainingArguments(
bf16=True, # or fp16=True
)
# Efficient optimizer
training_args = TrainingArguments(
optim="adafactor", # Lower memory than AdamW
)
Key Strategies:
- Batch sizes: Use powers of 2 (8, 16, 32, 64, 128)
- Gradient accumulation: Enables larger effective batch sizes
- Gradient checkpointing: Reduces memory ~60%, increases time ~20%
- Mixed precision: bf16 for Ampere+ GPUs, fp16 for older
- torch.compile: Optimize model graph (PyTorch 2.0+)
Advanced Features
Custom Training Loop
For maximum control, bypass Trainer:
from torch.utils.data import DataLoader
from transformers import AdamW, get_scheduler
# Prepare data
train_dataloader = DataLoader(tokenized_dataset, batch_size=8, shuffle=True)
# Setup optimizer and scheduler
optimizer = AdamW(model.parameters(), lr=5e-5)
scheduler = get_scheduler(
"linear",
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=len(train_dataloader) * num_epochs
)
# Training loop
model.train()
for epoch in range(num_epochs):
for batch in train_dataloader:
batch = {k: v.to(device) for k, v in batch.items()}
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
scheduler.step()
optimizer.zero_grad()
Parameter-Efficient Fine-Tuning (PEFT)
Use PEFT library with transformers for efficient fine-tuning:
from peft import LoraConfig, get_peft_model
# Configure LoRA
lora_config = LoraConfig(
r=16, # Low-rank dimension
lora_alpha=32,
target_modules=["q_proj", "v_proj"], # Which layers to adapt
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# Apply to model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
model = get_peft_model(model, lora_config)
# Now train as usual - only LoRA parameters train
trainer = Trainer(model=model, ...)
trainer.train()
Chat Templates
Apply chat templates for instruction-tuned models:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"},
]
# Format according to model's chat template
formatted = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
# Tokenize and generate
inputs = tokenizer(formatted, return_tensors="pt")
outputs = model.generate(**inputs, max_length=200)
response = tokenizer.decode(outputs[0])
Multi-GPU Training
# Automatic with Trainer - no code changes needed
# Just run with: accelerate launch train.py
# Or use PyTorch DDP explicitly
training_args = TrainingArguments(
output_dir="./results",
ddp_find_unused_parameters=False,
# ... other args
)
# For larger models, use FSDP
training_args = TrainingArguments(
output_dir="./results",
fsdp="full_shard auto_wrap",
fsdp_config={
"fsdp_transformer_layer_cls_to_wrap": ["BertLayer"],
},
)
Task-Specific Patterns
Question Answering (Extractive)
from transformers import pipeline
qa = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")
result = qa(
question="What is extractive QA?",
context="Extractive QA extracts the answer from the given context..."
)
# {'answer': 'extracts the answer from the given context', 'score': 0.97, ...}
Named Entity Recognition
ner = pipeline("token-classification", model="dslim/bert-base-NER")
result = ner("My name is John and I live in New York")
# [{'entity': 'B-PER', 'word': 'John', ...}, {'entity': 'B-LOC', 'word': 'New York', ...}]
Image Captioning
from transformers import AutoProcessor, AutoModelForCausalLM
processor = AutoProcessor.from_pretrained("microsoft/git-base")
model = AutoModelForCausalLM.from_pretrained("microsoft/git-base")
from PIL import Image
image = Image.open("image.jpg")
inputs = processor(images=image, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
caption = processor.batch_decode(outputs, skip_special_tokens=True)[0]
Speech Recognition
transcriber = pipeline(
"automatic-speech-recognition",
model="openai/whisper-base"
)
result = transcriber("audio.mp3")
# {'text': 'This is the transcribed text...'}
# With timestamps
result = transcriber("audio.mp3", return_timestamps=True)
Common Patterns and Best Practices
Saving and Loading Models
# Save entire model
model.save_pretrained("./my-model")
tokenizer.save_pretrained("./my-model")
# Load later
model = AutoModel.from_pretrained("./my-model")
tokenizer = AutoTokenizer.from_pretrained("./my-model")
# Push to Hugging Face Hub
model.push_to_hub("username/my-model")
tokenizer.push_to_hub("username/my-model")
# Load from Hub
model = AutoModel.from_pretrained("username/my-model")
Error Handling
from transformers import AutoModel
import torch
try:
model = AutoModel.from_pretrained("model-name")
except OSError:
print("Model not found - check internet connection or model name")
except torch.cuda.OutOfMemoryError:
print("GPU memory exceeded - try quantization or smaller batch size")
Device Management
import torch
# Check device availability
device = "cuda" if torch.cuda.is_available() else "cpu"
# Move model to device
model = model.to(device)
# Or use device_map for automatic distribution
model = AutoModel.from_pretrained("model-name", device_map="auto")
# For inputs
inputs = tokenizer(text, return_tensors="pt").to(device)
Memory Management
import torch
# Clear CUDA cache
torch.cuda.empty_cache()
# Use context manager for inference
with torch.no_grad():
outputs = model(**inputs)
# Delete unused models
del model
torch.cuda.empty_cache()
Resources
This skill includes comprehensive reference documentation and example scripts:
scripts/
quick_inference.py: Ready-to-use script for running inference with pipelinesfine_tune_classifier.py: Complete example for fine-tuning a text classifiergenerate_text.py: Text generation with various strategies
Execute scripts directly or read them as implementation templates.
references/
api_reference.md: Comprehensive API documentation for key classestraining_guide.md: Detailed training patterns, optimization, and troubleshootinggeneration_strategies.md: In-depth guide to text generation methodsquantization.md: Model quantization techniques comparison and usagetask_patterns.md: Quick reference for common task implementations
Load reference files when you need detailed information on specific topics. References contain extensive examples, parameter explanations, and best practices.
Troubleshooting
Import errors:
pip install transformers
pip install accelerate # For device_map="auto"
pip install bitsandbytes # For quantization
CUDA out of memory:
- Reduce batch size
- Enable gradient checkpointing
- Use gradient accumulation
- Try quantization (8-bit or 4-bit)
- Use smaller model variant
Slow training:
- Enable mixed precision (fp16/bf16)
- Increase batch size (if memory allows)
- Use torch.compile (PyTorch 2.0+)
- Check data loading isn't bottleneck
Poor generation quality:
- Adjust temperature (lower = more focused)
- Try different decoding strategies (beam search vs sampling)
- Increase max_length if outputs cut off
- Use repetition_penalty to reduce repetition
For task-specific guidance, consult the appropriate reference file in the references/ directory.