Files
claude-scientific-skills/scientific-packages/transformers/SKILL.md
2025-10-21 09:33:30 -07:00

23 KiB

name, description
name description
transformers Hugging Face Transformers. Load BERT, GPT, T5, ViT, CLIP, Llama models, fine-tune, text generation, classification, NER, pipelines, LoRA, for NLP/vision/audio tasks.

Transformers

Overview

Transformers is Hugging Face's flagship library providing unified access to over 1 million pretrained models for machine learning across text, vision, audio, and multimodal domains. The library serves as a standardized model-definition framework compatible with PyTorch, TensorFlow, and JAX, emphasizing ease of use through three core components:

  • Pipeline: Simple, optimized inference API for common tasks
  • AutoClasses: Automatic model/tokenizer selection from pretrained checkpoints
  • Trainer: Full-featured training loop with distributed training, mixed precision, and optimization

The library prioritizes accessibility with pretrained models that reduce computational costs and carbon footprint while providing compatibility across major training frameworks (PyTorch-Lightning, DeepSpeed, vLLM, etc.).

Quick Start with Pipelines

Use pipelines for simple, efficient inference without managing models, tokenizers, or preprocessing manually. Pipelines abstract complexity into a single function call.

Basic Pipeline Usage

from transformers import pipeline

# Text classification
classifier = pipeline("text-classification")
result = classifier("This restaurant is awesome")
# [{'label': 'POSITIVE', 'score': 0.9998}]

# Text generation
generator = pipeline("text-generation", model="meta-llama/Llama-2-7b-hf")
generator("The secret to baking a good cake is", max_length=50)

# Question answering
qa = pipeline("question-answering")
qa(question="What is extractive QA?", context="Extractive QA is...")

# Image classification
img_classifier = pipeline("image-classification")
img_classifier("path/to/image.jpg")

# Automatic speech recognition
transcriber = pipeline("automatic-speech-recognition")
transcriber("audio_file.mp3")

Available Pipeline Tasks

NLP Tasks:

  • text-classification, token-classification, question-answering
  • fill-mask, summarization, translation
  • text-generation, conversational
  • zero-shot-classification, sentiment-analysis

Vision Tasks:

  • image-classification, image-segmentation, object-detection
  • depth-estimation, image-to-image, zero-shot-image-classification

Audio Tasks:

  • automatic-speech-recognition, audio-classification
  • text-to-audio, zero-shot-audio-classification

Multimodal Tasks:

  • visual-question-answering, document-question-answering
  • image-to-text, zero-shot-object-detection

Pipeline Best Practices

Device Management:

from transformers import pipeline, infer_device

device = infer_device()  # Auto-detect best device
pipe = pipeline("text-generation", model="...", device=device)

Batch Processing:

# Process multiple inputs efficiently
results = classifier(["Text 1", "Text 2", "Text 3"])

# Use KeyDataset for large datasets
from transformers.pipelines.pt_utils import KeyDataset
from datasets import load_dataset

dataset = load_dataset("imdb", split="test")
for result in pipe(KeyDataset(dataset, "text")):
    print(result)

Memory Optimization:

# Use half-precision for faster inference
pipe = pipeline("text-generation", model="...",
                torch_dtype=torch.float16, device="cuda")

Core Components

AutoClasses for Model Loading

AutoClasses automatically select the correct architecture based on pretrained checkpoints.

from transformers import (
    AutoModel, AutoTokenizer, AutoConfig,
    AutoModelForCausalLM, AutoModelForSequenceClassification
)

# Load any model by checkpoint name
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Task-specific model classes
causal_lm = AutoModelForCausalLM.from_pretrained("gpt2")
classifier = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=3
)

# Load with device and dtype optimization
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    device_map="auto",      # Automatically distribute across devices
    torch_dtype="auto"      # Use optimal dtype
)

Key Parameters:

  • device_map="auto": Optimal device allocation (CPU/GPU/multi-GPU)
  • torch_dtype: Control precision (torch.float16, torch.bfloat16, "auto")
  • trust_remote_code: Enable custom model code (use cautiously)
  • use_fast: Enable Rust-backed fast tokenizers (default True)

Tokenization

Tokenizers convert text to model-compatible tensor inputs.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Basic tokenization
tokens = tokenizer.tokenize("Hello, how are you?")
# ['hello', ',', 'how', 'are', 'you', '?']

# Encoding (text → token IDs)
encoded = tokenizer("Hello, how are you?", return_tensors="pt")
# {'input_ids': tensor([[...]], 'attention_mask': tensor([[...]])}

# Batch encoding with padding and truncation
batch = tokenizer(
    ["Short text", "This is a much longer text..."],
    padding=True,           # Pad to longest in batch
    truncation=True,        # Truncate to model's max length
    max_length=512,
    return_tensors="pt"
)

# Decoding (token IDs → text)
text = tokenizer.decode(encoded['input_ids'][0])

Special Tokens:

# Access special tokens
tokenizer.pad_token      # Padding token
tokenizer.cls_token      # Classification token
tokenizer.sep_token      # Separator token
tokenizer.mask_token     # Mask token (for MLM)

# Add custom tokens
tokenizer.add_tokens(["[CUSTOM]"])
tokenizer.add_special_tokens({'additional_special_tokens': ['[NEW]']})

# Resize model embeddings to match new vocabulary
model.resize_token_embeddings(len(tokenizer))

Image Processors

For vision tasks, use image processors instead of tokenizers.

from transformers import AutoImageProcessor

processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")

# Process single image
from PIL import Image
image = Image.open("path/to/image.jpg")
inputs = processor(image, return_tensors="pt")
# Returns: {'pixel_values': tensor([[...]])}

# Batch processing
images = [Image.open(f"img{i}.jpg") for i in range(3)]
inputs = processor(images, return_tensors="pt")

Processors for Multimodal Models

Multimodal models use processors that combine image and text processing.

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("microsoft/git-base")

# Process image + text caption
inputs = processor(
    images=image,
    text="A description of the image",
    return_tensors="pt",
    padding=True
)

Model Inference

Basic Inference Pattern

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Tokenize input
inputs = tokenizer("The future of AI is", return_tensors="pt")

# Generate (for causal LM)
outputs = model.generate(**inputs, max_length=50)
text = tokenizer.decode(outputs[0])

# Or get model outputs directly
outputs = model(**inputs)
logits = outputs.logits  # Shape: (batch_size, seq_len, vocab_size)

Text Generation Strategies

For generative models, control generation behavior with parameters:

# Greedy decoding (default)
output = model.generate(inputs, max_length=50)

# Beam search (multiple hypothesis)
output = model.generate(
    inputs,
    max_length=50,
    num_beams=5,           # Keep top 5 beams
    early_stopping=True
)

# Sampling with temperature
output = model.generate(
    inputs,
    max_length=50,
    do_sample=True,
    temperature=0.7,       # Lower = more focused, higher = more random
    top_k=50,              # Sample from top 50 tokens
    top_p=0.95             # Nucleus sampling
)

# Streaming generation
from transformers import TextStreamer

streamer = TextStreamer(tokenizer)
model.generate(**inputs, streamer=streamer, max_length=100)

Generation Parameters:

  • max_length / max_new_tokens: Control output length
  • num_beams: Beam search width (1 = greedy)
  • temperature: Randomness (0.7-1.0 typical)
  • top_k: Sample from top k tokens
  • top_p: Nucleus sampling threshold
  • repetition_penalty: Discourage repetition (>1.0)

Refer to references/generation_strategies.md for detailed information on choosing appropriate strategies.

Training and Fine-Tuning

Training Workflow Overview

  1. Load dataset → 2. Preprocess → 3. Configure training → 4. Train → 5. Evaluate → 6. Save/Share

Text Classification Example

from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    TrainingArguments, Trainer, DataCollatorWithPadding
)
from datasets import load_dataset

# 1. Load dataset
dataset = load_dataset("imdb")

# 2. Preprocess
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def preprocess(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized = dataset.map(preprocess, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# 3. Load model
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=2,
    id2label={0: "negative", 1: "positive"},
    label2id={"negative": 0, "positive": 1}
)

# 4. Configure training
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

# 5. Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

# 6. Evaluate and save
metrics = trainer.evaluate()
trainer.save_model("./my-finetuned-model")
trainer.push_to_hub()  # Share to Hugging Face Hub

Vision Task Fine-Tuning

from transformers import (
    AutoImageProcessor, AutoModelForImageClassification,
    TrainingArguments, Trainer
)
from datasets import load_dataset

# Load dataset
dataset = load_dataset("food101", split="train[:5000]")

# Image preprocessing
processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")

def transform(examples):
    examples["pixel_values"] = [
        processor(img.convert("RGB"), return_tensors="pt")["pixel_values"][0]
        for img in examples["image"]
    ]
    return examples

dataset = dataset.with_transform(transform)

# Load model
model = AutoModelForImageClassification.from_pretrained(
    "google/vit-base-patch16-224",
    num_labels=101,  # 101 food categories
    ignore_mismatched_sizes=True
)

# Training (similar pattern to text)
training_args = TrainingArguments(
    output_dir="./vit-food101",
    remove_unused_columns=False,  # Keep image data
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=32,
    num_train_epochs=3,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=processor,
)

trainer.train()

Sequence-to-Sequence Tasks

For tasks like summarization, translation, use Seq2SeqTrainer:

from transformers import (
    AutoTokenizer, AutoModelForSeq2SeqLM,
    Seq2SeqTrainingArguments, Seq2SeqTrainer,
    DataCollatorForSeq2Seq
)

tokenizer = AutoTokenizer.from_pretrained("t5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

def preprocess(examples):
    # Prefix input for T5
    inputs = ["summarize: " + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    # Tokenize targets
    labels = tokenizer(
        examples["summary"],
        max_length=128,
        truncation=True
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset = dataset.map(preprocess, batched=True)

training_args = Seq2SeqTrainingArguments(
    output_dir="./t5-summarization",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=3,
    predict_with_generate=True,  # Important for seq2seq
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
)

trainer.train()

Important TrainingArguments

TrainingArguments(
    # Essential
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    learning_rate=2e-5,

    # Evaluation
    eval_strategy="epoch",        # or "steps"
    eval_steps=500,               # if eval_strategy="steps"

    # Checkpointing
    save_strategy="epoch",
    save_steps=500,
    save_total_limit=2,           # Keep only 2 best checkpoints
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",

    # Optimization
    gradient_accumulation_steps=4,
    warmup_steps=500,
    weight_decay=0.01,
    max_grad_norm=1.0,

    # Mixed Precision
    fp16=True,                    # For Nvidia GPUs
    bf16=True,                    # For Ampere+ GPUs (better)

    # Logging
    logging_steps=100,
    report_to="tensorboard",      # or "wandb", "mlflow"

    # Memory Optimization
    gradient_checkpointing=True,
    optim="adamw_torch",          # or "adafactor" for memory

    # Distributed Training
    ddp_find_unused_parameters=False,
)

Refer to references/training_guide.md for comprehensive training patterns and optimization strategies.

Performance Optimization

Model Quantization

Reduce memory footprint while maintaining accuracy:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# 8-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    load_in_8bit=True,
    device_map="auto"
)

# 4-bit quantization (even smaller)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

Quantization Methods:

  • Bitsandbytes: 4/8-bit on-the-fly quantization, supports PEFT fine-tuning
  • GPTQ: 2/3/4/8-bit, requires calibration, very fast inference
  • AWQ: 4-bit activation-aware, balanced speed/accuracy

Refer to references/quantization.md for detailed comparison and usage patterns.

Training Optimization

# Gradient accumulation (simulate larger batch)
training_args = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,  # Effective batch = 4 * 8 = 32
)

# Gradient checkpointing (reduce memory, slower)
training_args = TrainingArguments(
    gradient_checkpointing=True,
)

# Mixed precision training
training_args = TrainingArguments(
    bf16=True,  # or fp16=True
)

# Efficient optimizer
training_args = TrainingArguments(
    optim="adafactor",  # Lower memory than AdamW
)

Key Strategies:

  • Batch sizes: Use powers of 2 (8, 16, 32, 64, 128)
  • Gradient accumulation: Enables larger effective batch sizes
  • Gradient checkpointing: Reduces memory ~60%, increases time ~20%
  • Mixed precision: bf16 for Ampere+ GPUs, fp16 for older
  • torch.compile: Optimize model graph (PyTorch 2.0+)

Advanced Features

Custom Training Loop

For maximum control, bypass Trainer:

from torch.utils.data import DataLoader
from transformers import AdamW, get_scheduler

# Prepare data
train_dataloader = DataLoader(tokenized_dataset, batch_size=8, shuffle=True)

# Setup optimizer and scheduler
optimizer = AdamW(model.parameters(), lr=5e-5)
scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=len(train_dataloader) * num_epochs
)

# Training loop
model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}

        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

Parameter-Efficient Fine-Tuning (PEFT)

Use PEFT library with transformers for efficient fine-tuning:

from peft import LoraConfig, get_peft_model

# Configure LoRA
lora_config = LoraConfig(
    r=16,                   # Low-rank dimension
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],  # Which layers to adapt
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply to model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
model = get_peft_model(model, lora_config)

# Now train as usual - only LoRA parameters train
trainer = Trainer(model=model, ...)
trainer.train()

Chat Templates

Apply chat templates for instruction-tuned models:

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is machine learning?"},
]

# Format according to model's chat template
formatted = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# Tokenize and generate
inputs = tokenizer(formatted, return_tensors="pt")
outputs = model.generate(**inputs, max_length=200)
response = tokenizer.decode(outputs[0])

Multi-GPU Training

# Automatic with Trainer - no code changes needed
# Just run with: accelerate launch train.py

# Or use PyTorch DDP explicitly
training_args = TrainingArguments(
    output_dir="./results",
    ddp_find_unused_parameters=False,
    # ... other args
)

# For larger models, use FSDP
training_args = TrainingArguments(
    output_dir="./results",
    fsdp="full_shard auto_wrap",
    fsdp_config={
        "fsdp_transformer_layer_cls_to_wrap": ["BertLayer"],
    },
)

Task-Specific Patterns

Question Answering (Extractive)

from transformers import pipeline

qa = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")

result = qa(
    question="What is extractive QA?",
    context="Extractive QA extracts the answer from the given context..."
)
# {'answer': 'extracts the answer from the given context', 'score': 0.97, ...}

Named Entity Recognition

ner = pipeline("token-classification", model="dslim/bert-base-NER")

result = ner("My name is John and I live in New York")
# [{'entity': 'B-PER', 'word': 'John', ...}, {'entity': 'B-LOC', 'word': 'New York', ...}]

Image Captioning

from transformers import AutoProcessor, AutoModelForCausalLM

processor = AutoProcessor.from_pretrained("microsoft/git-base")
model = AutoModelForCausalLM.from_pretrained("microsoft/git-base")

from PIL import Image
image = Image.open("image.jpg")

inputs = processor(images=image, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
caption = processor.batch_decode(outputs, skip_special_tokens=True)[0]

Speech Recognition

transcriber = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-base"
)

result = transcriber("audio.mp3")
# {'text': 'This is the transcribed text...'}

# With timestamps
result = transcriber("audio.mp3", return_timestamps=True)

Common Patterns and Best Practices

Saving and Loading Models

# Save entire model
model.save_pretrained("./my-model")
tokenizer.save_pretrained("./my-model")

# Load later
model = AutoModel.from_pretrained("./my-model")
tokenizer = AutoTokenizer.from_pretrained("./my-model")

# Push to Hugging Face Hub
model.push_to_hub("username/my-model")
tokenizer.push_to_hub("username/my-model")

# Load from Hub
model = AutoModel.from_pretrained("username/my-model")

Error Handling

from transformers import AutoModel
import torch

try:
    model = AutoModel.from_pretrained("model-name")
except OSError:
    print("Model not found - check internet connection or model name")
except torch.cuda.OutOfMemoryError:
    print("GPU memory exceeded - try quantization or smaller batch size")

Device Management

import torch

# Check device availability
device = "cuda" if torch.cuda.is_available() else "cpu"

# Move model to device
model = model.to(device)

# Or use device_map for automatic distribution
model = AutoModel.from_pretrained("model-name", device_map="auto")

# For inputs
inputs = tokenizer(text, return_tensors="pt").to(device)

Memory Management

import torch

# Clear CUDA cache
torch.cuda.empty_cache()

# Use context manager for inference
with torch.no_grad():
    outputs = model(**inputs)

# Delete unused models
del model
torch.cuda.empty_cache()

Resources

This skill includes comprehensive reference documentation and example scripts:

scripts/

  • quick_inference.py: Ready-to-use script for running inference with pipelines
  • fine_tune_classifier.py: Complete example for fine-tuning a text classifier
  • generate_text.py: Text generation with various strategies

Execute scripts directly or read them as implementation templates.

references/

  • api_reference.md: Comprehensive API documentation for key classes
  • training_guide.md: Detailed training patterns, optimization, and troubleshooting
  • generation_strategies.md: In-depth guide to text generation methods
  • quantization.md: Model quantization techniques comparison and usage
  • task_patterns.md: Quick reference for common task implementations

Load reference files when you need detailed information on specific topics. References contain extensive examples, parameter explanations, and best practices.

Troubleshooting

Import errors:

pip install transformers
pip install accelerate  # For device_map="auto"
pip install bitsandbytes  # For quantization

CUDA out of memory:

  • Reduce batch size
  • Enable gradient checkpointing
  • Use gradient accumulation
  • Try quantization (8-bit or 4-bit)
  • Use smaller model variant

Slow training:

  • Enable mixed precision (fp16/bf16)
  • Increase batch size (if memory allows)
  • Use torch.compile (PyTorch 2.0+)
  • Check data loading isn't bottleneck

Poor generation quality:

  • Adjust temperature (lower = more focused)
  • Try different decoding strategies (beam search vs sampling)
  • Increase max_length if outputs cut off
  • Use repetition_penalty to reduce repetition

For task-specific guidance, consult the appropriate reference file in the references/ directory.