claude-scientific-skills/scientific-packages/transformers/SKILL.md

---
name: transformers
description: "Hugging Face Transformers. Load BERT, GPT, T5, ViT, CLIP, Llama models, fine-tune, text generation, classification, NER, pipelines, LoRA, for NLP/vision/audio tasks."
---

# Transformers

## Overview

Transformers is Hugging Face's flagship library providing unified access to over 1 million pretrained models for machine learning across text, vision, audio, and multimodal domains. The library serves as a standardized model-definition framework compatible with PyTorch, TensorFlow, and JAX, emphasizing ease of use through three core components:

- **Pipeline**: Simple, optimized inference API for common tasks
- **AutoClasses**: Automatic model/tokenizer selection from pretrained checkpoints
- **Trainer**: Full-featured training loop with distributed training, mixed precision, and optimization

The library prioritizes accessibility with pretrained models that reduce computational costs and carbon footprint while providing compatibility across major training frameworks (PyTorch-Lightning, DeepSpeed, vLLM, etc.).

## Quick Start with Pipelines

Use pipelines for simple, efficient inference without managing models, tokenizers, or preprocessing manually. Pipelines abstract complexity into a single function call.

### Basic Pipeline Usage

```python
from transformers import pipeline

# Text classification
classifier = pipeline("text-classification")
result = classifier("This restaurant is awesome")
# [{'label': 'POSITIVE', 'score': 0.9998}]

# Text generation
generator = pipeline("text-generation", model="meta-llama/Llama-2-7b-hf")
generator("The secret to baking a good cake is", max_length=50)

# Question answering
qa = pipeline("question-answering")
qa(question="What is extractive QA?", context="Extractive QA is...")

# Image classification
img_classifier = pipeline("image-classification")
img_classifier("path/to/image.jpg")

# Automatic speech recognition
transcriber = pipeline("automatic-speech-recognition")
transcriber("audio_file.mp3")
```

### Available Pipeline Tasks

**NLP Tasks:**
- `text-classification`, `token-classification`, `question-answering`
- `fill-mask`, `summarization`, `translation`
- `text-generation`, `conversational`
- `zero-shot-classification`, `sentiment-analysis`

**Vision Tasks:**
- `image-classification`, `image-segmentation`, `object-detection`
- `depth-estimation`, `image-to-image`, `zero-shot-image-classification`

**Audio Tasks:**
- `automatic-speech-recognition`, `audio-classification`
- `text-to-audio`, `zero-shot-audio-classification`

**Multimodal Tasks:**
- `visual-question-answering`, `document-question-answering`
- `image-to-text`, `zero-shot-object-detection`

### Pipeline Best Practices

**Device Management:**
```python
from transformers import pipeline, infer_device

device = infer_device()  # Auto-detect best device
pipe = pipeline("text-generation", model="...", device=device)
```

**Batch Processing:**
```python
# Process multiple inputs efficiently
results = classifier(["Text 1", "Text 2", "Text 3"])

# Use KeyDataset for large datasets
from transformers.pipelines.pt_utils import KeyDataset
from datasets import load_dataset

dataset = load_dataset("imdb", split="test")
for result in pipe(KeyDataset(dataset, "text")):
    print(result)
```

**Memory Optimization:**
```python
# Use half-precision for faster inference
pipe = pipeline("text-generation", model="...",
                torch_dtype=torch.float16, device="cuda")
```

## Core Components

### AutoClasses for Model Loading

AutoClasses automatically select the correct architecture based on pretrained checkpoints.

```python
from transformers import (
    AutoModel, AutoTokenizer, AutoConfig,
    AutoModelForCausalLM, AutoModelForSequenceClassification
)

# Load any model by checkpoint name
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Task-specific model classes
causal_lm = AutoModelForCausalLM.from_pretrained("gpt2")
classifier = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=3
)

# Load with device and dtype optimization
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    device_map="auto",      # Automatically distribute across devices
    torch_dtype="auto"      # Use optimal dtype
)
```

**Key Parameters:**
- `device_map="auto"`: Optimal device allocation (CPU/GPU/multi-GPU)
- `torch_dtype`: Control precision (torch.float16, torch.bfloat16, "auto")
- `trust_remote_code`: Enable custom model code (use cautiously)
- `use_fast`: Enable Rust-backed fast tokenizers (default True)

### Tokenization

Tokenizers convert text to model-compatible tensor inputs.

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Basic tokenization
tokens = tokenizer.tokenize("Hello, how are you?")
# ['hello', ',', 'how', 'are', 'you', '?']

# Encoding (text → token IDs)
encoded = tokenizer("Hello, how are you?", return_tensors="pt")
# {'input_ids': tensor([[...]], 'attention_mask': tensor([[...]])}

# Batch encoding with padding and truncation
batch = tokenizer(
    ["Short text", "This is a much longer text..."],
    padding=True,           # Pad to longest in batch
    truncation=True,        # Truncate to model's max length
    max_length=512,
    return_tensors="pt"
)

# Decoding (token IDs → text)
text = tokenizer.decode(encoded['input_ids'][0])
```

**Special Tokens:**
```python
# Access special tokens
tokenizer.pad_token      # Padding token
tokenizer.cls_token      # Classification token
tokenizer.sep_token      # Separator token
tokenizer.mask_token     # Mask token (for MLM)

# Add custom tokens
tokenizer.add_tokens(["[CUSTOM]"])
tokenizer.add_special_tokens({'additional_special_tokens': ['[NEW]']})

# Resize model embeddings to match new vocabulary
model.resize_token_embeddings(len(tokenizer))
```

### Image Processors

For vision tasks, use image processors instead of tokenizers.

```python
from transformers import AutoImageProcessor

processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")

# Process single image
from PIL import Image
image = Image.open("path/to/image.jpg")
inputs = processor(image, return_tensors="pt")
# Returns: {'pixel_values': tensor([[...]])}

# Batch processing
images = [Image.open(f"img{i}.jpg") for i in range(3)]
inputs = processor(images, return_tensors="pt")
```

### Processors for Multimodal Models

Multimodal models use processors that combine image and text processing.

```python
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("microsoft/git-base")

# Process image + text caption
inputs = processor(
    images=image,
    text="A description of the image",
    return_tensors="pt",
    padding=True
)
```

## Model Inference

### Basic Inference Pattern

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Tokenize input
inputs = tokenizer("The future of AI is", return_tensors="pt")

# Generate (for causal LM)
outputs = model.generate(**inputs, max_length=50)
text = tokenizer.decode(outputs[0])

# Or get model outputs directly
outputs = model(**inputs)
logits = outputs.logits  # Shape: (batch_size, seq_len, vocab_size)
```

### Text Generation Strategies

For generative models, control generation behavior with parameters:

```python
# Greedy decoding (default)
output = model.generate(inputs, max_length=50)

# Beam search (multiple hypothesis)
output = model.generate(
    inputs,
    max_length=50,
    num_beams=5,           # Keep top 5 beams
    early_stopping=True
)

# Sampling with temperature
output = model.generate(
    inputs,
    max_length=50,
    do_sample=True,
    temperature=0.7,       # Lower = more focused, higher = more random
    top_k=50,              # Sample from top 50 tokens
    top_p=0.95             # Nucleus sampling
)

# Streaming generation
from transformers import TextStreamer

streamer = TextStreamer(tokenizer)
model.generate(**inputs, streamer=streamer, max_length=100)
```

**Generation Parameters:**
- `max_length` / `max_new_tokens`: Control output length
- `num_beams`: Beam search width (1 = greedy)
- `temperature`: Randomness (0.7-1.0 typical)
- `top_k`: Sample from top k tokens
- `top_p`: Nucleus sampling threshold
- `repetition_penalty`: Discourage repetition (>1.0)

Refer to `references/generation_strategies.md` for detailed information on choosing appropriate strategies.

## Training and Fine-Tuning

### Training Workflow Overview

1. **Load dataset** → 2. **Preprocess** → 3. **Configure training** → 4. **Train** → 5. **Evaluate** → 6. **Save/Share**

### Text Classification Example

```python
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    TrainingArguments, Trainer, DataCollatorWithPadding
)
from datasets import load_dataset

# 1. Load dataset
dataset = load_dataset("imdb")

# 2. Preprocess
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def preprocess(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized = dataset.map(preprocess, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# 3. Load model
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=2,
    id2label={0: "negative", 1: "positive"},
    label2id={"negative": 0, "positive": 1}
)

# 4. Configure training
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

# 5. Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

# 6. Evaluate and save
metrics = trainer.evaluate()
trainer.save_model("./my-finetuned-model")
trainer.push_to_hub()  # Share to Hugging Face Hub
```

### Vision Task Fine-Tuning

```python
from transformers import (
    AutoImageProcessor, AutoModelForImageClassification,
    TrainingArguments, Trainer
)
from datasets import load_dataset

# Load dataset
dataset = load_dataset("food101", split="train[:5000]")

# Image preprocessing
processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")

def transform(examples):
    examples["pixel_values"] = [
        processor(img.convert("RGB"), return_tensors="pt")["pixel_values"][0]
        for img in examples["image"]
    ]
    return examples

dataset = dataset.with_transform(transform)

# Load model
model = AutoModelForImageClassification.from_pretrained(
    "google/vit-base-patch16-224",
    num_labels=101,  # 101 food categories
    ignore_mismatched_sizes=True
)

# Training (similar pattern to text)
training_args = TrainingArguments(
    output_dir="./vit-food101",
    remove_unused_columns=False,  # Keep image data
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=32,
    num_train_epochs=3,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=processor,
)

trainer.train()
```

### Sequence-to-Sequence Tasks

For tasks like summarization, translation, use Seq2SeqTrainer:

```python
from transformers import (
    AutoTokenizer, AutoModelForSeq2SeqLM,
    Seq2SeqTrainingArguments, Seq2SeqTrainer,
    DataCollatorForSeq2Seq
)

tokenizer = AutoTokenizer.from_pretrained("t5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

def preprocess(examples):
    # Prefix input for T5
    inputs = ["summarize: " + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    # Tokenize targets
    labels = tokenizer(
        examples["summary"],
        max_length=128,
        truncation=True
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset = dataset.map(preprocess, batched=True)

training_args = Seq2SeqTrainingArguments(
    output_dir="./t5-summarization",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=3,
    predict_with_generate=True,  # Important for seq2seq
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
)

trainer.train()
```

### Important TrainingArguments

```python
TrainingArguments(
    # Essential
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    learning_rate=2e-5,

    # Evaluation
    eval_strategy="epoch",        # or "steps"
    eval_steps=500,               # if eval_strategy="steps"

    # Checkpointing
    save_strategy="epoch",
    save_steps=500,
    save_total_limit=2,           # Keep only 2 best checkpoints
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",

    # Optimization
    gradient_accumulation_steps=4,
    warmup_steps=500,
    weight_decay=0.01,
    max_grad_norm=1.0,

    # Mixed Precision
    fp16=True,                    # For Nvidia GPUs
    bf16=True,                    # For Ampere+ GPUs (better)

    # Logging
    logging_steps=100,
    report_to="tensorboard",      # or "wandb", "mlflow"

    # Memory Optimization
    gradient_checkpointing=True,
    optim="adamw_torch",          # or "adafactor" for memory

    # Distributed Training
    ddp_find_unused_parameters=False,
)
```

Refer to `references/training_guide.md` for comprehensive training patterns and optimization strategies.

## Performance Optimization

### Model Quantization

Reduce memory footprint while maintaining accuracy:

```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# 8-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    load_in_8bit=True,
    device_map="auto"
)

# 4-bit quantization (even smaller)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)
```

**Quantization Methods:**
- **Bitsandbytes**: 4/8-bit on-the-fly quantization, supports PEFT fine-tuning
- **GPTQ**: 2/3/4/8-bit, requires calibration, very fast inference
- **AWQ**: 4-bit activation-aware, balanced speed/accuracy

Refer to `references/quantization.md` for detailed comparison and usage patterns.

### Training Optimization

```python
# Gradient accumulation (simulate larger batch)
training_args = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,  # Effective batch = 4 * 8 = 32
)

# Gradient checkpointing (reduce memory, slower)
training_args = TrainingArguments(
    gradient_checkpointing=True,
)

# Mixed precision training
training_args = TrainingArguments(
    bf16=True,  # or fp16=True
)

# Efficient optimizer
training_args = TrainingArguments(
    optim="adafactor",  # Lower memory than AdamW
)
```

**Key Strategies:**
- **Batch sizes**: Use powers of 2 (8, 16, 32, 64, 128)
- **Gradient accumulation**: Enables larger effective batch sizes
- **Gradient checkpointing**: Reduces memory ~60%, increases time ~20%
- **Mixed precision**: bf16 for Ampere+ GPUs, fp16 for older
- **torch.compile**: Optimize model graph (PyTorch 2.0+)

## Advanced Features

### Custom Training Loop

For maximum control, bypass Trainer:

```python
from torch.utils.data import DataLoader
from transformers import AdamW, get_scheduler

# Prepare data
train_dataloader = DataLoader(tokenized_dataset, batch_size=8, shuffle=True)

# Setup optimizer and scheduler
optimizer = AdamW(model.parameters(), lr=5e-5)
scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=len(train_dataloader) * num_epochs
)

# Training loop
model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}

        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
```

### Parameter-Efficient Fine-Tuning (PEFT)

Use PEFT library with transformers for efficient fine-tuning:

```python
from peft import LoraConfig, get_peft_model

# Configure LoRA
lora_config = LoraConfig(
    r=16,                   # Low-rank dimension
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],  # Which layers to adapt
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply to model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
model = get_peft_model(model, lora_config)

# Now train as usual - only LoRA parameters train
trainer = Trainer(model=model, ...)
trainer.train()
```

### Chat Templates

Apply chat templates for instruction-tuned models:

```python
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is machine learning?"},
]

# Format according to model's chat template
formatted = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# Tokenize and generate
inputs = tokenizer(formatted, return_tensors="pt")
outputs = model.generate(**inputs, max_length=200)
response = tokenizer.decode(outputs[0])
```

### Multi-GPU Training

```python
# Automatic with Trainer - no code changes needed
# Just run with: accelerate launch train.py

# Or use PyTorch DDP explicitly
training_args = TrainingArguments(
    output_dir="./results",
    ddp_find_unused_parameters=False,
    # ... other args
)

# For larger models, use FSDP
training_args = TrainingArguments(
    output_dir="./results",
    fsdp="full_shard auto_wrap",
    fsdp_config={
        "fsdp_transformer_layer_cls_to_wrap": ["BertLayer"],
    },
)
```

## Task-Specific Patterns

### Question Answering (Extractive)

```python
from transformers import pipeline

qa = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")

result = qa(
    question="What is extractive QA?",
    context="Extractive QA extracts the answer from the given context..."
)
# {'answer': 'extracts the answer from the given context', 'score': 0.97, ...}
```

### Named Entity Recognition

```python
ner = pipeline("token-classification", model="dslim/bert-base-NER")

result = ner("My name is John and I live in New York")
# [{'entity': 'B-PER', 'word': 'John', ...}, {'entity': 'B-LOC', 'word': 'New York', ...}]
```

### Image Captioning

```python
from transformers import AutoProcessor, AutoModelForCausalLM

processor = AutoProcessor.from_pretrained("microsoft/git-base")
model = AutoModelForCausalLM.from_pretrained("microsoft/git-base")

from PIL import Image
image = Image.open("image.jpg")

inputs = processor(images=image, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
caption = processor.batch_decode(outputs, skip_special_tokens=True)[0]
```

### Speech Recognition

```python
transcriber = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-base"
)

result = transcriber("audio.mp3")
# {'text': 'This is the transcribed text...'}

# With timestamps
result = transcriber("audio.mp3", return_timestamps=True)
```

## Common Patterns and Best Practices

### Saving and Loading Models

```python
# Save entire model
model.save_pretrained("./my-model")
tokenizer.save_pretrained("./my-model")

# Load later
model = AutoModel.from_pretrained("./my-model")
tokenizer = AutoTokenizer.from_pretrained("./my-model")

# Push to Hugging Face Hub
model.push_to_hub("username/my-model")
tokenizer.push_to_hub("username/my-model")

# Load from Hub
model = AutoModel.from_pretrained("username/my-model")
```

### Error Handling

```python
from transformers import AutoModel
import torch

try:
    model = AutoModel.from_pretrained("model-name")
except OSError:
    print("Model not found - check internet connection or model name")
except torch.cuda.OutOfMemoryError:
    print("GPU memory exceeded - try quantization or smaller batch size")
```

### Device Management

```python
import torch

# Check device availability
device = "cuda" if torch.cuda.is_available() else "cpu"

# Move model to device
model = model.to(device)

# Or use device_map for automatic distribution
model = AutoModel.from_pretrained("model-name", device_map="auto")

# For inputs
inputs = tokenizer(text, return_tensors="pt").to(device)
```

### Memory Management

```python
import torch

# Clear CUDA cache
torch.cuda.empty_cache()

# Use context manager for inference
with torch.no_grad():
    outputs = model(**inputs)

# Delete unused models
del model
torch.cuda.empty_cache()
```

## Resources

This skill includes comprehensive reference documentation and example scripts:

### scripts/

- `quick_inference.py`: Ready-to-use script for running inference with pipelines
- `fine_tune_classifier.py`: Complete example for fine-tuning a text classifier
- `generate_text.py`: Text generation with various strategies

Execute scripts directly or read them as implementation templates.

### references/

- `api_reference.md`: Comprehensive API documentation for key classes
- `training_guide.md`: Detailed training patterns, optimization, and troubleshooting
- `generation_strategies.md`: In-depth guide to text generation methods
- `quantization.md`: Model quantization techniques comparison and usage
- `task_patterns.md`: Quick reference for common task implementations

Load reference files when you need detailed information on specific topics. References contain extensive examples, parameter explanations, and best practices.

## Troubleshooting

**Import errors:**
```bash
pip install transformers
pip install accelerate  # For device_map="auto"
pip install bitsandbytes  # For quantization
```

**CUDA out of memory:**
- Reduce batch size
- Enable gradient checkpointing
- Use gradient accumulation
- Try quantization (8-bit or 4-bit)
- Use smaller model variant

**Slow training:**
- Enable mixed precision (fp16/bf16)
- Increase batch size (if memory allows)
- Use torch.compile (PyTorch 2.0+)
- Check data loading isn't bottleneck

**Poor generation quality:**
- Adjust temperature (lower = more focused)
- Try different decoding strategies (beam search vs sampling)
- Increase max_length if outputs cut off
- Use repetition_penalty to reduce repetition

For task-specific guidance, consult the appropriate reference file in the `references/` directory.