Improve the Hugging Face transformers skill

This commit is contained in:
Timothy Kassis
2025-11-03 16:44:15 -08:00
parent 86d8878eeb
commit c56fa43747
12 changed files with 2041 additions and 2705 deletions

View File

@@ -1,349 +1,157 @@
--- ---
name: transformers name: transformers
description: Work with state-of-the-art machine learning models for NLP, computer vision, audio, and multimodal tasks using HuggingFace Transformers. This skill should be used when fine-tuning pre-trained models, performing inference with pipelines, generating text, training sequence models, or working with BERT, GPT, T5, ViT, and other transformer architectures. Covers model loading, tokenization, training with Trainer API, text generation strategies, and task-specific patterns for classification, NER, QA, summarization, translation, and image tasks. (plugin:scientific-packages@claude-scientific-skills) description: This skill should be used when working with pre-trained transformer models for natural language processing, computer vision, audio, or multimodal tasks. Use for text generation, classification, question answering, translation, summarization, image classification, object detection, speech recognition, and fine-tuning models on custom datasets.
--- ---
# Transformers # Transformers
## Overview ## Overview
The Transformers library provides state-of-the-art machine learning models for NLP, computer vision, audio, and multimodal tasks. Apply this skill for quick inference through pipelines, comprehensive training via the Trainer API, and flexible text generation with various decoding strategies. The Hugging Face Transformers library provides access to thousands of pre-trained models for tasks across NLP, computer vision, audio, and multimodal domains. Use this skill to load models, perform inference, and fine-tune on custom data.
## Core Capabilities ## Installation
### 1. Quick Inference with Pipelines Install transformers and core dependencies:
For rapid inference without complex setup, use the `pipeline()` API. Pipelines abstract away tokenization, model invocation, and post-processing. ```bash
uv pip install torch transformers datasets evaluate accelerate
```
For vision tasks, add:
```bash
uv pip install timm pillow
```
For audio tasks, add:
```bash
uv pip install librosa soundfile
```
## Authentication
Many models on the Hugging Face Hub require authentication. Set up access:
```python
from huggingface_hub import login
login() # Follow prompts to enter token
```
Or set environment variable:
```bash
export HUGGINGFACE_TOKEN="your_token_here"
```
Get tokens at: https://huggingface.co/settings/tokens
## Quick Start
Use the Pipeline API for fast inference without manual configuration:
```python ```python
from transformers import pipeline from transformers import pipeline
# Text generation
generator = pipeline("text-generation", model="gpt2")
result = generator("The future of AI is", max_length=50)
# Text classification # Text classification
classifier = pipeline("text-classification") classifier = pipeline("text-classification")
result = classifier("This product is amazing!") result = classifier("This movie was excellent!")
# Named entity recognition
ner = pipeline("token-classification")
entities = ner("Sarah works at Microsoft in Seattle")
# Question answering # Question answering
qa = pipeline("question-answering") qa = pipeline("question-answering")
answer = qa(question="What is the capital?", context="Paris is the capital of France.") result = qa(question="What is AI?", context="AI is artificial intelligence...")
# Text generation
generator = pipeline("text-generation", model="gpt2")
text = generator("Once upon a time", max_length=50)
# Image classification
image_classifier = pipeline("image-classification")
predictions = image_classifier("image.jpg")
``` ```
**When to use pipelines:** ## Core Capabilities
- Quick prototyping and testing
- Simple inference tasks without custom logic
- Demonstrations and examples
- Production inference for standard tasks
**Available pipeline tasks:** ### 1. Pipelines for Quick Inference
- **NLP**: text-classification, token-classification, question-answering, summarization, translation, text-generation, fill-mask, zero-shot-classification
- **Vision**: image-classification, object-detection, image-segmentation, depth-estimation, zero-shot-image-classification
- **Audio**: automatic-speech-recognition, audio-classification, text-to-audio
- **Multimodal**: image-to-text, visual-question-answering, image-text-to-text
For comprehensive pipeline documentation, see `references/pipelines.md`. Use for simple, optimized inference across many tasks. Supports text generation, classification, NER, question answering, summarization, translation, image classification, object detection, audio classification, and more.
### 2. Model Training and Fine-Tuning **When to use**: Quick prototyping, simple inference tasks, no custom preprocessing needed.
Use the Trainer API for comprehensive model training with support for distributed training, mixed precision, and advanced optimization. See `references/pipelines.md` for comprehensive task coverage and optimization.
**Basic training workflow:** ### 2. Model Loading and Management
Load pre-trained models with fine-grained control over configuration, device placement, and precision.
**When to use**: Custom model initialization, advanced device management, model inspection.
See `references/models.md` for loading patterns and best practices.
### 3. Text Generation
Generate text with LLMs using various decoding strategies (greedy, beam search, sampling) and control parameters (temperature, top-k, top-p).
**When to use**: Creative text generation, code generation, conversational AI, text completion.
See `references/generation.md` for generation strategies and parameters.
### 4. Training and Fine-Tuning
Fine-tune pre-trained models on custom datasets using the Trainer API with automatic mixed precision, distributed training, and logging.
**When to use**: Task-specific model adaptation, domain adaptation, improving model performance.
See `references/training.md` for training workflows and best practices.
### 5. Tokenization
Convert text to tokens and token IDs for model input, with padding, truncation, and special token handling.
**When to use**: Custom preprocessing pipelines, understanding model inputs, batch processing.
See `references/tokenizers.md` for tokenization details.
## Common Patterns
### Pattern 1: Simple Inference
For straightforward tasks, use pipelines:
```python ```python
from transformers import ( pipe = pipeline("task-name", model="model-id")
AutoTokenizer, output = pipe(input_data)
AutoModelForSequenceClassification, ```
TrainingArguments,
Trainer
)
from datasets import load_dataset
# 1. Load and tokenize data ### Pattern 2: Custom Model Usage
dataset = load_dataset("imdb") For advanced control, load model and tokenizer separately:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") ```python
from transformers import AutoModelForCausalLM, AutoTokenizer
def tokenize_function(examples): tokenizer = AutoTokenizer.from_pretrained("model-id")
return tokenizer(examples["text"], padding="max_length", truncation=True) model = AutoModelForCausalLM.from_pretrained("model-id", device_map="auto")
tokenized_datasets = dataset.map(tokenize_function, batched=True) inputs = tokenizer("text", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
result = tokenizer.decode(outputs[0])
```
# 2. Load model ### Pattern 3: Fine-Tuning
model = AutoModelForSequenceClassification.from_pretrained( For task adaptation, use Trainer:
"bert-base-uncased", ```python
num_labels=2 from transformers import Trainer, TrainingArguments
)
# 3. Configure training
training_args = TrainingArguments( training_args = TrainingArguments(
output_dir="./results", output_dir="./results",
num_train_epochs=3, num_train_epochs=3,
per_device_train_batch_size=16, per_device_train_batch_size=8,
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
) )
# 4. Create trainer and train
trainer = Trainer( trainer = Trainer(
model=model, model=model,
args=training_args, args=training_args,
train_dataset=tokenized_datasets["train"], train_dataset=train_dataset,
eval_dataset=tokenized_datasets["test"],
) )
trainer.train() trainer.train()
``` ```
**Key training features:** ## Reference Documentation
- Mixed precision training (fp16/bf16)
- Distributed training (multi-GPU, multi-node)
- Gradient accumulation
- Learning rate scheduling with warmup
- Checkpoint management
- Hyperparameter search
- Push to Hugging Face Hub
For detailed training documentation, see `references/training.md`. For detailed information on specific components:
- **Pipelines**: `references/pipelines.md` - All supported tasks and optimization
### 3. Text Generation - **Models**: `references/models.md` - Loading, saving, and configuration
- **Generation**: `references/generation.md` - Text generation strategies and parameters
Generate text using various decoding strategies including greedy decoding, beam search, sampling, and more. - **Training**: `references/training.md` - Fine-tuning with Trainer API
- **Tokenizers**: `references/tokenizers.md` - Tokenization and preprocessing
**Generation strategies:**
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
inputs = tokenizer("Once upon a time", return_tensors="pt")
# Greedy decoding (deterministic)
outputs = model.generate(**inputs, max_new_tokens=50)
# Beam search (explores multiple hypotheses)
outputs = model.generate(
**inputs,
max_new_tokens=50,
num_beams=5,
early_stopping=True
)
# Sampling (creative, diverse)
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=True,
temperature=0.7,
top_p=0.9,
top_k=50
)
```
**Generation parameters:**
- `temperature`: Controls randomness (0.1-2.0)
- `top_k`: Sample from top-k tokens
- `top_p`: Nucleus sampling threshold
- `num_beams`: Number of beams for beam search
- `repetition_penalty`: Discourage repetition
- `no_repeat_ngram_size`: Prevent repeating n-grams
For comprehensive generation documentation, see `references/generation_strategies.md`.
### 4. Task-Specific Patterns
Common task patterns with appropriate model classes:
**Text Classification:**
```python
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=3,
id2label={0: "negative", 1: "neutral", 2: "positive"}
)
```
**Named Entity Recognition (Token Classification):**
```python
from transformers import AutoModelForTokenClassification
model = AutoModelForTokenClassification.from_pretrained(
"bert-base-uncased",
num_labels=9 # Number of entity types
)
```
**Question Answering:**
```python
from transformers import AutoModelForQuestionAnswering
model = AutoModelForQuestionAnswering.from_pretrained("bert-base-uncased")
```
**Summarization and Translation (Seq2Seq):**
```python
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
```
**Image Classification:**
```python
from transformers import AutoModelForImageClassification
model = AutoModelForImageClassification.from_pretrained(
"google/vit-base-patch16-224",
num_labels=num_classes
)
```
For detailed task-specific workflows including data preprocessing, training, and evaluation, see `references/task_patterns.md`.
## Auto Classes
Use Auto classes for automatic architecture selection based on model checkpoints:
```python
from transformers import (
AutoTokenizer, # Tokenization
AutoModel, # Base model (hidden states)
AutoModelForSequenceClassification,
AutoModelForTokenClassification,
AutoModelForQuestionAnswering,
AutoModelForCausalLM, # GPT-style
AutoModelForMaskedLM, # BERT-style
AutoModelForSeq2SeqLM, # T5, BART
AutoProcessor, # For multimodal models
AutoImageProcessor, # For vision models
)
# Load any model by name
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
```
For comprehensive API documentation, see `references/api_reference.md`.
## Model Loading and Optimization
**Device placement:**
```python
model = AutoModel.from_pretrained("bert-base-uncased", device_map="auto")
```
**Mixed precision:**
```python
model = AutoModel.from_pretrained(
"model-name",
torch_dtype=torch.float16 # or torch.bfloat16
)
```
**Quantization:**
```python
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=quantization_config,
device_map="auto"
)
```
## Common Workflows
### Quick Inference Workflow
1. Choose appropriate pipeline for task
2. Load pipeline with optional model specification
3. Pass inputs and get results
4. For batch processing, pass list of inputs
**See:** `scripts/quick_inference.py` for comprehensive pipeline examples
### Training Workflow
1. Load and preprocess dataset using 🤗 Datasets
2. Tokenize data with appropriate tokenizer
3. Load pre-trained model for specific task
4. Configure TrainingArguments
5. Create Trainer with model, data, and compute_metrics
6. Train with `trainer.train()`
7. Evaluate with `trainer.evaluate()`
8. Save model and optionally push to Hub
**See:** `scripts/fine_tune_classifier.py` for complete training example
### Text Generation Workflow
1. Load causal or seq2seq language model
2. Load tokenizer and tokenize prompt
3. Choose generation strategy (greedy, beam search, sampling)
4. Configure generation parameters
5. Generate with `model.generate()`
6. Decode output tokens to text
**See:** `scripts/generate_text.py` for generation strategy examples
## Best Practices
1. **Use Auto classes** for flexibility across different model architectures
2. **Batch processing** for efficiency - process multiple inputs at once
3. **Device management** - use `device_map="auto"` for automatic placement
4. **Memory optimization** - enable fp16/bf16 or quantization for large models
5. **Checkpoint management** - save checkpoints regularly and load best model
6. **Pipeline for quick tasks** - use pipelines for standard inference tasks
7. **Custom metrics** - define compute_metrics for task-specific evaluation
8. **Gradient accumulation** - use for large effective batch sizes on limited memory
9. **Learning rate warmup** - typically 5-10% of total training steps
10. **Hub integration** - push trained models to Hub for sharing and versioning
## Resources
### scripts/
Executable Python scripts demonstrating common Transformers workflows:
- `quick_inference.py` - Pipeline examples for NLP, vision, audio, and multimodal tasks
- `fine_tune_classifier.py` - Complete fine-tuning workflow with Trainer API
- `generate_text.py` - Text generation with various decoding strategies
Run scripts directly to see examples in action:
```bash
python scripts/quick_inference.py
python scripts/fine_tune_classifier.py
python scripts/generate_text.py
```
### references/
Comprehensive reference documentation loaded into context as needed:
- `api_reference.md` - Core classes and APIs (Auto classes, Trainer, GenerationConfig, etc.)
- `pipelines.md` - All available pipelines organized by modality with examples
- `training.md` - Training patterns, TrainingArguments, distributed training, callbacks
- `generation_strategies.md` - Text generation methods, decoding strategies, parameters
- `task_patterns.md` - Complete workflows for common tasks (classification, NER, QA, summarization, etc.)
When working on specific tasks or features, load the relevant reference file for detailed guidance.
## Additional Information
- **Official Documentation**: https://huggingface.co/docs/transformers/index
- **Model Hub**: https://huggingface.co/models (1M+ pre-trained models)
- **Datasets Hub**: https://huggingface.co/datasets
- **Installation**: `pip install transformers datasets evaluate accelerate`
- **GPU Support**: Requires PyTorch or TensorFlow with CUDA
- **Framework Support**: PyTorch (primary), TensorFlow, JAX/Flax

View File

@@ -1,485 +0,0 @@
# Transformers API Reference
This reference covers the core classes and APIs in the Transformers library.
## Core Auto Classes
Auto classes provide a convenient way to automatically select the appropriate architecture based on model name or checkpoint.
### AutoTokenizer
```python
from transformers import AutoTokenizer
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Tokenize single text
encoded = tokenizer("Hello, how are you?")
# Returns: {'input_ids': [...], 'attention_mask': [...]}
# Tokenize with options
encoded = tokenizer(
"Hello, how are you?",
padding="max_length",
truncation=True,
max_length=512,
return_tensors="pt" # "pt" for PyTorch, "tf" for TensorFlow
)
# Tokenize pairs (for classification, QA, etc.)
encoded = tokenizer(
"Question or sentence A",
"Context or sentence B",
padding=True,
truncation=True
)
# Batch tokenization
texts = ["Text 1", "Text 2", "Text 3"]
encoded = tokenizer(texts, padding=True, truncation=True)
# Decode tokens back to text
text = tokenizer.decode(token_ids, skip_special_tokens=True)
# Batch decode
texts = tokenizer.batch_decode(batch_token_ids, skip_special_tokens=True)
```
**Key Parameters:**
- `padding`: "max_length", "longest", or True (pad to max in batch)
- `truncation`: True or strategy ("longest_first", "only_first", "only_second")
- `max_length`: Maximum sequence length
- `return_tensors`: "pt" (PyTorch), "tf" (TensorFlow), "np" (NumPy)
- `return_attention_mask`: Return attention masks (default True)
- `return_token_type_ids`: Return token type IDs for pairs (default True)
- `add_special_tokens`: Add special tokens like [CLS], [SEP] (default True)
**Special Properties:**
- `tokenizer.vocab_size`: Size of vocabulary
- `tokenizer.pad_token_id`: ID of padding token
- `tokenizer.eos_token_id`: ID of end-of-sequence token
- `tokenizer.bos_token_id`: ID of beginning-of-sequence token
- `tokenizer.unk_token_id`: ID of unknown token
### AutoModel
Base model class that outputs hidden states.
```python
from transformers import AutoModel
model = AutoModel.from_pretrained("bert-base-uncased")
# Forward pass
outputs = model(**inputs)
# Access hidden states
last_hidden_state = outputs.last_hidden_state # [batch_size, seq_length, hidden_size]
pooler_output = outputs.pooler_output # [batch_size, hidden_size]
# Get all hidden states
model = AutoModel.from_pretrained("bert-base-uncased", output_hidden_states=True)
outputs = model(**inputs)
all_hidden_states = outputs.hidden_states # Tuple of tensors
```
### Task-Specific Auto Classes
```python
from transformers import (
AutoModelForSequenceClassification,
AutoModelForTokenClassification,
AutoModelForQuestionAnswering,
AutoModelForCausalLM,
AutoModelForMaskedLM,
AutoModelForSeq2SeqLM,
AutoModelForImageClassification,
AutoModelForObjectDetection,
AutoModelForVision2Seq,
)
# Sequence classification (sentiment, topic, etc.)
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=3,
id2label={0: "negative", 1: "neutral", 2: "positive"},
label2id={"negative": 0, "neutral": 1, "positive": 2}
)
# Token classification (NER, POS tagging)
model = AutoModelForTokenClassification.from_pretrained(
"bert-base-uncased",
num_labels=9 # Number of entity types
)
# Question answering
model = AutoModelForQuestionAnswering.from_pretrained("bert-base-uncased")
# Causal language modeling (GPT-style)
model = AutoModelForCausalLM.from_pretrained("gpt2")
# Masked language modeling (BERT-style)
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")
# Sequence-to-sequence (T5, BART)
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
# Image classification
model = AutoModelForImageClassification.from_pretrained("google/vit-base-patch16-224")
# Object detection
model = AutoModelForObjectDetection.from_pretrained("facebook/detr-resnet-50")
# Vision-to-text (image captioning, VQA)
model = AutoModelForVision2Seq.from_pretrained("microsoft/git-base")
```
### AutoProcessor
For multimodal models that need both text and image processing.
```python
from transformers import AutoProcessor
# For vision-language models
processor = AutoProcessor.from_pretrained("microsoft/git-base")
# Process image and text
from PIL import Image
image = Image.open("image.jpg")
inputs = processor(images=image, text="caption", return_tensors="pt")
# For audio models
processor = AutoProcessor.from_pretrained("openai/whisper-base")
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
```
### AutoImageProcessor
For vision-only models.
```python
from transformers import AutoImageProcessor
processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
# Process single image
from PIL import Image
image = Image.open("image.jpg")
inputs = processor(image, return_tensors="pt")
# Batch processing
images = [Image.open(f"image{i}.jpg") for i in range(10)]
inputs = processor(images, return_tensors="pt")
```
## Model Loading Options
### from_pretrained Parameters
```python
model = AutoModel.from_pretrained(
"model-name",
# Device and precision
device_map="auto", # Automatic device placement
torch_dtype=torch.float16, # Use fp16
low_cpu_mem_usage=True, # Reduce CPU memory during loading
# Quantization
load_in_8bit=True, # 8-bit quantization
load_in_4bit=True, # 4-bit quantization
# Model configuration
num_labels=3, # For classification
id2label={...}, # Label mapping
label2id={...},
# Outputs
output_hidden_states=True,
output_attentions=True,
# Trust remote code
trust_remote_code=True, # For custom models
# Caching
cache_dir="./cache",
force_download=False,
resume_download=True,
)
```
### Quantization with BitsAndBytes
```python
from transformers import BitsAndBytesConfig, AutoModelForCausalLM
# 4-bit quantization
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=quantization_config,
device_map="auto"
)
```
## Training Components
### TrainingArguments
See `training.md` for comprehensive coverage. Key parameters:
```python
from transformers import TrainingArguments
args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
learning_rate=2e-5,
weight_decay=0.01,
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="accuracy",
fp16=True,
logging_steps=100,
save_total_limit=2,
)
```
### Trainer
```python
from transformers import Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
data_collator=data_collator,
callbacks=[callback1, callback2],
)
# Train
trainer.train()
# Resume from checkpoint
trainer.train(resume_from_checkpoint=True)
# Evaluate
metrics = trainer.evaluate()
# Predict
predictions = trainer.predict(test_dataset)
# Hyperparameter search
best_trial = trainer.hyperparameter_search(
direction="maximize",
backend="optuna",
n_trials=10,
)
# Save model
trainer.save_model("./final_model")
# Push to Hub
trainer.push_to_hub(commit_message="Training complete")
```
### Data Collators
```python
from transformers import (
DataCollatorWithPadding,
DataCollatorForTokenClassification,
DataCollatorForSeq2Seq,
DataCollatorForLanguageModeling,
DefaultDataCollator,
)
# For classification/regression with dynamic padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
# For token classification (NER)
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
# For seq2seq tasks
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
# For language modeling
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=True, # True for masked LM, False for causal LM
mlm_probability=0.15
)
# Default (no special handling)
data_collator = DefaultDataCollator()
```
## Generation Components
### GenerationConfig
See `generation_strategies.md` for comprehensive coverage.
```python
from transformers import GenerationConfig
config = GenerationConfig(
max_new_tokens=100,
do_sample=True,
temperature=0.7,
top_p=0.9,
top_k=50,
num_beams=5,
repetition_penalty=1.2,
no_repeat_ngram_size=3,
)
# Use with model
outputs = model.generate(**inputs, generation_config=config)
```
### generate() Method
```python
outputs = model.generate(
input_ids=inputs.input_ids,
attention_mask=inputs.attention_mask,
max_new_tokens=100,
do_sample=True,
temperature=0.7,
top_p=0.9,
num_return_sequences=3,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
)
```
## Pipeline API
See `pipelines.md` for comprehensive coverage.
```python
from transformers import pipeline
# Basic usage
pipe = pipeline("task-name", model="model-name", device=0)
results = pipe(inputs)
# With custom model
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained("model-name")
tokenizer = AutoTokenizer.from_pretrained("model-name")
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
```
## Configuration Classes
### Model Configuration
```python
from transformers import AutoConfig
# Load configuration
config = AutoConfig.from_pretrained("bert-base-uncased")
# Access configuration
print(config.hidden_size)
print(config.num_attention_heads)
print(config.num_hidden_layers)
# Modify configuration
config.num_labels = 5
config.output_hidden_states = True
# Create model with config
model = AutoModel.from_config(config)
# Save configuration
config.save_pretrained("./config")
```
## Utilities
### Hub Utilities
```python
from huggingface_hub import login, snapshot_download
# Login
login(token="hf_...")
# Download model
snapshot_download(repo_id="model-name", cache_dir="./cache")
# Push to Hub
model.push_to_hub("username/model-name", commit_message="Initial commit")
tokenizer.push_to_hub("username/model-name")
```
### Evaluation Metrics
```python
import evaluate
# Load metric
metric = evaluate.load("accuracy")
# Compute metric
results = metric.compute(predictions=predictions, references=labels)
# Common metrics
accuracy = evaluate.load("accuracy")
precision = evaluate.load("precision")
recall = evaluate.load("recall")
f1 = evaluate.load("f1")
bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")
```
## Model Outputs
All models return dataclass objects with named attributes:
```python
# Sequence classification output
outputs = model(**inputs)
logits = outputs.logits # [batch_size, num_labels]
loss = outputs.loss # If labels provided
# Causal LM output
outputs = model(**inputs)
logits = outputs.logits # [batch_size, seq_length, vocab_size]
past_key_values = outputs.past_key_values # KV cache
# Seq2Seq output
outputs = model(**inputs, labels=labels)
loss = outputs.loss
logits = outputs.logits
encoder_last_hidden_state = outputs.encoder_last_hidden_state
# Access as dict
outputs_dict = outputs.to_tuple() # or dict(outputs)
```
## Best Practices
1. **Use Auto classes**: AutoModel, AutoTokenizer for flexibility
2. **Device management**: Use `device_map="auto"` for multi-GPU
3. **Memory optimization**: Use `torch_dtype=torch.float16` and quantization
4. **Caching**: Set `cache_dir` to avoid re-downloading
5. **Batch processing**: Process multiple inputs at once for efficiency
6. **Trust remote code**: Only set `trust_remote_code=True` for trusted sources

View File

@@ -0,0 +1,467 @@
# Text Generation
## Overview
Generate text with language models using the `generate()` method. Control output quality and style through generation strategies and parameters.
## Basic Generation
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# Tokenize input
inputs = tokenizer("Once upon a time", return_tensors="pt")
# Generate
outputs = model.generate(**inputs, max_new_tokens=50)
# Decode
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)
```
## Generation Strategies
### Greedy Decoding
Select highest probability token at each step (deterministic):
```python
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=False # Greedy decoding (default)
)
```
**Use for**: Factual text, translations, where determinism is needed.
### Sampling
Randomly sample from probability distribution:
```python
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=True,
temperature=0.7,
top_k=50,
top_p=0.95
)
```
**Use for**: Creative writing, diverse outputs, open-ended generation.
### Beam Search
Explore multiple hypotheses in parallel:
```python
outputs = model.generate(
**inputs,
max_new_tokens=50,
num_beams=5,
early_stopping=True
)
```
**Use for**: Translations, summarization, where quality is critical.
### Contrastive Search
Balance quality and diversity:
```python
outputs = model.generate(
**inputs,
max_new_tokens=50,
penalty_alpha=0.6,
top_k=4
)
```
**Use for**: Long-form generation, reducing repetition.
## Key Parameters
### Length Control
**max_new_tokens**: Maximum tokens to generate
```python
max_new_tokens=100 # Generate up to 100 new tokens
```
**max_length**: Maximum total length (input + output)
```python
max_length=512 # Total sequence length
```
**min_new_tokens**: Minimum tokens to generate
```python
min_new_tokens=50 # Force at least 50 tokens
```
**min_length**: Minimum total length
```python
min_length=100
```
### Temperature
Controls randomness (only with sampling):
```python
temperature=1.0 # Default, balanced
temperature=0.7 # More focused, less random
temperature=1.5 # More creative, more random
```
Lower temperature → more deterministic
Higher temperature → more random
### Top-K Sampling
Consider only top K most likely tokens:
```python
do_sample=True
top_k=50 # Sample from top 50 tokens
```
**Common values**: 40-100 for balanced output, 10-20 for focused output.
### Top-P (Nucleus) Sampling
Consider tokens with cumulative probability ≥ P:
```python
do_sample=True
top_p=0.95 # Sample from smallest set with 95% cumulative probability
```
**Common values**: 0.9-0.95 for balanced, 0.7-0.85 for focused.
### Repetition Penalty
Discourage repetition:
```python
repetition_penalty=1.2 # Penalize repeated tokens
```
**Values**: 1.0 = no penalty, 1.2-1.5 = moderate, 2.0+ = strong penalty.
### Beam Search Parameters
**num_beams**: Number of beams
```python
num_beams=5 # Keep 5 hypotheses
```
**early_stopping**: Stop when num_beams sentences are finished
```python
early_stopping=True
```
**no_repeat_ngram_size**: Prevent n-gram repetition
```python
no_repeat_ngram_size=3 # Don't repeat any 3-gram
```
### Output Control
**num_return_sequences**: Generate multiple outputs
```python
outputs = model.generate(
**inputs,
max_new_tokens=50,
num_beams=5,
num_return_sequences=3 # Return 3 different sequences
)
```
**pad_token_id**: Specify padding token
```python
pad_token_id=tokenizer.eos_token_id
```
**eos_token_id**: Stop generation at specific token
```python
eos_token_id=tokenizer.eos_token_id
```
## Advanced Features
### Batch Generation
Generate for multiple prompts:
```python
prompts = ["Hello, my name is", "Once upon a time"]
inputs = tokenizer(prompts, return_tensors="pt", padding=True)
outputs = model.generate(**inputs, max_new_tokens=50)
for i, output in enumerate(outputs):
text = tokenizer.decode(output, skip_special_tokens=True)
print(f"Prompt {i}: {text}\n")
```
### Streaming Generation
Stream tokens as generated:
```python
from transformers import TextIteratorStreamer
from threading import Thread
streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)
generation_kwargs = dict(
inputs,
streamer=streamer,
max_new_tokens=100
)
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
for text in streamer:
print(text, end="", flush=True)
thread.join()
```
### Constrained Generation
Force specific token sequences:
```python
# Force generation to start with specific tokens
force_words = ["Paris", "France"]
force_words_ids = [tokenizer.encode(word, add_special_tokens=False) for word in force_words]
outputs = model.generate(
**inputs,
force_words_ids=force_words_ids,
num_beams=5
)
```
### Guidance and Control
**Prevent bad words:**
```python
bad_words = ["offensive", "inappropriate"]
bad_words_ids = [tokenizer.encode(word, add_special_tokens=False) for word in bad_words]
outputs = model.generate(
**inputs,
bad_words_ids=bad_words_ids
)
```
### Generation Config
Save and reuse generation parameters:
```python
from transformers import GenerationConfig
# Create config
generation_config = GenerationConfig(
max_new_tokens=100,
temperature=0.7,
top_k=50,
top_p=0.95,
do_sample=True
)
# Save
generation_config.save_pretrained("./my_generation_config")
# Load and use
generation_config = GenerationConfig.from_pretrained("./my_generation_config")
outputs = model.generate(**inputs, generation_config=generation_config)
```
## Model-Specific Generation
### Chat Models
Use chat templates:
```python
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
]
input_text = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
```
### Encoder-Decoder Models
For T5, BART, etc.:
```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
tokenizer = AutoTokenizer.from_pretrained("t5-small")
# T5 uses task prefixes
input_text = "translate English to French: Hello, how are you?"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
```
## Optimization
### Caching
Enable KV cache for faster generation:
```python
outputs = model.generate(
**inputs,
max_new_tokens=100,
use_cache=True # Default, faster generation
)
```
### Static Cache
For fixed sequence lengths:
```python
from transformers import StaticCache
cache = StaticCache(model.config, max_batch_size=1, max_cache_len=1024, device="cuda")
outputs = model.generate(
**inputs,
max_new_tokens=100,
past_key_values=cache
)
```
### Attention Implementation
Use Flash Attention for speed:
```python
model = AutoModelForCausalLM.from_pretrained(
"model-id",
attn_implementation="flash_attention_2"
)
```
## Generation Recipes
### Creative Writing
```python
outputs = model.generate(
**inputs,
max_new_tokens=200,
do_sample=True,
temperature=0.8,
top_k=50,
top_p=0.95,
repetition_penalty=1.2
)
```
### Factual Generation
```python
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=False, # Greedy
repetition_penalty=1.1
)
```
### Diverse Outputs
```python
outputs = model.generate(
**inputs,
max_new_tokens=100,
num_beams=5,
num_return_sequences=5,
temperature=1.5,
do_sample=True
)
```
### Long-Form Generation
```python
outputs = model.generate(
**inputs,
max_new_tokens=1000,
penalty_alpha=0.6, # Contrastive search
top_k=4,
repetition_penalty=1.2
)
```
### Translation/Summarization
```python
outputs = model.generate(
**inputs,
max_new_tokens=100,
num_beams=5,
early_stopping=True,
no_repeat_ngram_size=3
)
```
## Common Issues
**Repetitive output:**
- Increase repetition_penalty (1.2-1.5)
- Use no_repeat_ngram_size (2-3)
- Try contrastive search
- Lower temperature
**Poor quality:**
- Use beam search (num_beams=5)
- Lower temperature
- Adjust top_k/top_p
**Too deterministic:**
- Enable sampling (do_sample=True)
- Increase temperature (0.7-1.0)
- Adjust top_k/top_p
**Slow generation:**
- Reduce batch size
- Enable use_cache=True
- Use Flash Attention
- Reduce max_new_tokens
## Best Practices
1. **Start with defaults**: Then tune based on output
2. **Use appropriate strategy**: Greedy for factual, sampling for creative
3. **Set max_new_tokens**: Avoid unnecessarily long generation
4. **Enable caching**: For faster sequential generation
5. **Tune temperature**: Most impactful parameter for sampling
6. **Use beam search carefully**: Slower but higher quality
7. **Test different seeds**: For reproducibility with sampling
8. **Monitor memory**: Large beams use significant memory

View File

@@ -1,373 +0,0 @@
# Text Generation Strategies
Transformers provides flexible text generation capabilities through the `generate()` method, supporting multiple decoding strategies and configuration options.
## Basic Generation
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
inputs = tokenizer("Once upon a time", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
generated_text = tokenizer.decode(outputs[0])
```
## Decoding Strategies
### 1. Greedy Decoding
Selects the token with highest probability at each step. Deterministic but can be repetitive.
```python
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=False,
num_beams=1 # Greedy is default when num_beams=1 and do_sample=False
)
```
### 2. Beam Search
Explores multiple hypotheses simultaneously, keeping top-k candidates at each step.
```python
outputs = model.generate(
**inputs,
max_new_tokens=50,
num_beams=5, # Number of beams
early_stopping=True, # Stop when all beams reach EOS
no_repeat_ngram_size=2, # Prevent repeating n-grams
)
```
**Key parameters:**
- `num_beams`: Number of beams (higher = more thorough but slower)
- `early_stopping`: Stop when all beams finish (True/False)
- `length_penalty`: Exponential penalty for length (>1.0 favors longer sequences)
- `no_repeat_ngram_size`: Prevent repeating n-grams
### 3. Sampling (Multinomial)
Samples from probability distribution, introducing randomness and diversity.
```python
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=True,
temperature=0.7, # Controls randomness (lower = more focused)
top_k=50, # Consider only top-k tokens
top_p=0.9, # Nucleus sampling (cumulative probability threshold)
)
```
**Key parameters:**
- `temperature`: Scales logits before softmax (0.1-2.0 typical range)
- Lower (0.1-0.7): More focused, deterministic
- Higher (0.8-1.5): More creative, random
- `top_k`: Sample from top-k tokens only
- `top_p`: Nucleus sampling - sample from smallest set with cumulative probability > p
### 4. Beam Search with Sampling
Combines beam search with sampling for diverse but coherent outputs.
```python
outputs = model.generate(
**inputs,
max_new_tokens=50,
num_beams=5,
do_sample=True,
temperature=0.8,
top_k=50,
)
```
### 5. Contrastive Search
Balances coherence and diversity using contrastive objective.
```python
outputs = model.generate(
**inputs,
max_new_tokens=50,
penalty_alpha=0.6, # Contrastive penalty
top_k=4, # Consider top-k candidates
)
```
### 6. Assisted Decoding
Uses a smaller "assistant" model to speed up generation of larger model.
```python
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("gpt2-large")
assistant_model = AutoModelForCausalLM.from_pretrained("gpt2")
outputs = model.generate(
**inputs,
assistant_model=assistant_model,
max_new_tokens=50,
)
```
## GenerationConfig
Configure generation parameters with `GenerationConfig` for reusability.
```python
from transformers import GenerationConfig
generation_config = GenerationConfig(
max_new_tokens=100,
do_sample=True,
temperature=0.7,
top_p=0.9,
top_k=50,
repetition_penalty=1.2,
no_repeat_ngram_size=3,
)
# Use with model
outputs = model.generate(**inputs, generation_config=generation_config)
# Save and load
generation_config.save_pretrained("./config")
loaded_config = GenerationConfig.from_pretrained("./config")
```
## Key Parameters Reference
### Output Length Control
- `max_length`: Maximum total tokens (input + output)
- `max_new_tokens`: Maximum new tokens to generate (recommended over max_length)
- `min_length`: Minimum total tokens
- `min_new_tokens`: Minimum new tokens to generate
### Sampling Parameters
- `temperature`: Sampling temperature (0.1-2.0, default 1.0)
- `top_k`: Top-k sampling (1-100, typically 50)
- `top_p`: Nucleus sampling (0.0-1.0, typically 0.9)
- `do_sample`: Enable sampling (True/False)
### Beam Search Parameters
- `num_beams`: Number of beams (1-20, typically 5)
- `early_stopping`: Stop when beams finish (True/False)
- `length_penalty`: Length penalty (>1.0 favors longer, <1.0 favors shorter)
- `num_beam_groups`: Diverse beam search groups
- `diversity_penalty`: Penalty for similar beams
### Repetition Control
- `repetition_penalty`: Penalty for repeating tokens (1.0-2.0, default 1.0)
- `no_repeat_ngram_size`: Prevent repeating n-grams (2-5 typical)
- `encoder_repetition_penalty`: Penalty for repeating encoder tokens
### Special Tokens
- `bos_token_id`: Beginning of sequence token
- `eos_token_id`: End of sequence token (or list of tokens)
- `pad_token_id`: Padding token
- `forced_bos_token_id`: Force specific token at beginning
- `forced_eos_token_id`: Force specific token at end
### Multiple Sequences
- `num_return_sequences`: Number of sequences to return
- `num_beam_groups`: Number of diverse beam groups
## Advanced Generation Techniques
### Constrained Generation
Force generation to include specific tokens or follow patterns.
```python
from transformers import PhrasalConstraint
constraints = [
PhrasalConstraint(tokenizer("New York", add_special_tokens=False).input_ids)
]
outputs = model.generate(
**inputs,
constraints=constraints,
num_beams=5,
)
```
### Streaming Generation
Generate tokens one at a time for real-time display.
```python
from transformers import TextIteratorStreamer
from threading import Thread
streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)
generation_kwargs = dict(
**inputs,
max_new_tokens=100,
streamer=streamer,
)
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
for new_text in streamer:
print(new_text, end="", flush=True)
thread.join()
```
### Logit Processors
Customize token selection with custom logit processors.
```python
from transformers import LogitsProcessor, LogitsProcessorList
class CustomLogitsProcessor(LogitsProcessor):
def __call__(self, input_ids, scores):
# Modify scores here
return scores
logits_processor = LogitsProcessorList([CustomLogitsProcessor()])
outputs = model.generate(
**inputs,
logits_processor=logits_processor,
)
```
### Stopping Criteria
Define custom stopping conditions.
```python
from transformers import StoppingCriteria, StoppingCriteriaList
class CustomStoppingCriteria(StoppingCriteria):
def __call__(self, input_ids, scores, **kwargs):
# Return True to stop generation
return False
stopping_criteria = StoppingCriteriaList([CustomStoppingCriteria()])
outputs = model.generate(
**inputs,
stopping_criteria=stopping_criteria,
)
```
## Best Practices
### For Creative Tasks (Stories, Dialogue)
```python
outputs = model.generate(
**inputs,
max_new_tokens=200,
do_sample=True,
temperature=0.8,
top_p=0.95,
repetition_penalty=1.2,
no_repeat_ngram_size=3,
)
```
### For Factual Tasks (Summaries, QA)
```python
outputs = model.generate(
**inputs,
max_new_tokens=100,
num_beams=4,
early_stopping=True,
no_repeat_ngram_size=2,
length_penalty=1.0,
)
```
### For Chat/Instruction Following
```python
outputs = model.generate(
**inputs,
max_new_tokens=512,
do_sample=True,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.1,
)
```
## Vision-Language Model Generation
For models like LLaVA, BLIP-2, etc.:
```python
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
model = AutoModelForVision2Seq.from_pretrained("llava-hf/llava-1.5-7b-hf")
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
image = Image.open("image.jpg")
inputs = processor(text="Describe this image", images=image, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=True,
temperature=0.7,
)
generated_text = processor.decode(outputs[0], skip_special_tokens=True)
```
## Performance Optimization
### Use KV Cache
```python
# KV cache is enabled by default
outputs = model.generate(**inputs, use_cache=True)
```
### Mixed Precision
```python
import torch
with torch.cuda.amp.autocast():
outputs = model.generate(**inputs, max_new_tokens=100)
```
### Batch Generation
```python
texts = ["Prompt 1", "Prompt 2", "Prompt 3"]
inputs = tokenizer(texts, return_tensors="pt", padding=True)
outputs = model.generate(**inputs, max_new_tokens=50)
```
### Quantization
```python
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=quantization_config,
device_map="auto"
)
```

View File

@@ -0,0 +1,361 @@
# Model Loading and Management
## Overview
The transformers library provides flexible model loading with automatic architecture detection, device management, and configuration control.
## Loading Models
### AutoModel Classes
Use AutoModel classes for automatic architecture selection:
```python
from transformers import AutoModel, AutoModelForSequenceClassification, AutoModelForCausalLM
# Base model (no task head)
model = AutoModel.from_pretrained("bert-base-uncased")
# Sequence classification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
# Causal language modeling (GPT-style)
model = AutoModelForCausalLM.from_pretrained("gpt2")
# Masked language modeling (BERT-style)
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")
# Sequence-to-sequence (T5-style)
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
```
### Common AutoModel Classes
**NLP Tasks:**
- `AutoModelForSequenceClassification`: Text classification, sentiment analysis
- `AutoModelForTokenClassification`: NER, POS tagging
- `AutoModelForQuestionAnswering`: Extractive QA
- `AutoModelForCausalLM`: Text generation (GPT, Llama)
- `AutoModelForMaskedLM`: Masked language modeling (BERT)
- `AutoModelForSeq2SeqLM`: Translation, summarization (T5, BART)
**Vision Tasks:**
- `AutoModelForImageClassification`: Image classification
- `AutoModelForObjectDetection`: Object detection
- `AutoModelForImageSegmentation`: Image segmentation
**Audio Tasks:**
- `AutoModelForAudioClassification`: Audio classification
- `AutoModelForSpeechSeq2Seq`: Speech recognition
**Multimodal:**
- `AutoModelForVision2Seq`: Image captioning, VQA
## Loading Parameters
### Basic Parameters
**pretrained_model_name_or_path**: Model identifier or local path
```python
model = AutoModel.from_pretrained("bert-base-uncased") # From Hub
model = AutoModel.from_pretrained("./local/model/path") # From disk
```
**num_labels**: Number of output labels for classification
```python
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=3
)
```
**cache_dir**: Custom cache location
```python
model = AutoModel.from_pretrained("model-id", cache_dir="./my_cache")
```
### Device Management
**device_map**: Automatic device allocation for large models
```python
# Automatically distribute across GPUs and CPU
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
device_map="auto"
)
# Sequential placement
model = AutoModelForCausalLM.from_pretrained(
"model-id",
device_map="sequential"
)
# Custom device map
device_map = {
"transformer.layers.0": 0, # GPU 0
"transformer.layers.1": 1, # GPU 1
"transformer.layers.2": "cpu", # CPU
}
model = AutoModel.from_pretrained("model-id", device_map=device_map)
```
Manual device placement:
```python
import torch
model = AutoModel.from_pretrained("model-id")
model.to("cuda:0") # Move to GPU 0
model.to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))
```
### Precision Control
**torch_dtype**: Set model precision
```python
import torch
# Float16 (half precision)
model = AutoModel.from_pretrained("model-id", torch_dtype=torch.float16)
# BFloat16 (better range than float16)
model = AutoModel.from_pretrained("model-id", torch_dtype=torch.bfloat16)
# Auto (use original dtype)
model = AutoModel.from_pretrained("model-id", torch_dtype="auto")
```
### Attention Implementation
**attn_implementation**: Choose attention mechanism
```python
# Scaled Dot Product Attention (PyTorch 2.0+, fastest)
model = AutoModel.from_pretrained("model-id", attn_implementation="sdpa")
# Flash Attention 2 (requires flash-attn package)
model = AutoModel.from_pretrained("model-id", attn_implementation="flash_attention_2")
# Eager (default, most compatible)
model = AutoModel.from_pretrained("model-id", attn_implementation="eager")
```
### Memory Optimization
**low_cpu_mem_usage**: Reduce CPU memory during loading
```python
model = AutoModelForCausalLM.from_pretrained(
"large-model-id",
low_cpu_mem_usage=True,
device_map="auto"
)
```
**load_in_8bit**: 8-bit quantization (requires bitsandbytes)
```python
model = AutoModelForCausalLM.from_pretrained(
"model-id",
load_in_8bit=True,
device_map="auto"
)
```
**load_in_4bit**: 4-bit quantization
```python
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(
"model-id",
quantization_config=quantization_config,
device_map="auto"
)
```
## Model Configuration
### Loading with Custom Config
```python
from transformers import AutoConfig, AutoModel
# Load and modify config
config = AutoConfig.from_pretrained("bert-base-uncased")
config.hidden_dropout_prob = 0.2
config.attention_probs_dropout_prob = 0.2
# Initialize model with custom config
model = AutoModel.from_pretrained("bert-base-uncased", config=config)
```
### Initializing from Config Only
```python
config = AutoConfig.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_config(config) # Random weights
```
## Model Modes
### Training vs Evaluation Mode
Models load in evaluation mode by default:
```python
model = AutoModel.from_pretrained("model-id")
print(model.training) # False
# Switch to training mode
model.train()
# Switch back to evaluation mode
model.eval()
```
Evaluation mode disables dropout and uses batch norm statistics.
## Saving Models
### Save Locally
```python
model.save_pretrained("./my_model")
```
This creates:
- `config.json`: Model configuration
- `pytorch_model.bin` or `model.safetensors`: Model weights
### Save to Hugging Face Hub
```python
model.push_to_hub("username/model-name")
# With custom commit message
model.push_to_hub("username/model-name", commit_message="Update model")
# Private repository
model.push_to_hub("username/model-name", private=True)
```
## Model Inspection
### Parameter Count
```python
# Total parameters
total_params = model.num_parameters()
# Trainable parameters only
trainable_params = model.num_parameters(only_trainable=True)
print(f"Total: {total_params:,}")
print(f"Trainable: {trainable_params:,}")
```
### Memory Footprint
```python
memory_bytes = model.get_memory_footprint()
memory_mb = memory_bytes / 1024**2
print(f"Memory: {memory_mb:.2f} MB")
```
### Model Architecture
```python
print(model) # Print full architecture
# Access specific components
print(model.config)
print(model.base_model)
```
## Forward Pass
Basic inference:
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("model-id")
model = AutoModelForSequenceClassification.from_pretrained("model-id")
inputs = tokenizer("Sample text", return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
predictions = logits.argmax(dim=-1)
```
## Model Formats
### SafeTensors vs PyTorch
SafeTensors is faster and safer:
```python
# Save as safetensors (recommended)
model.save_pretrained("./model", safe_serialization=True)
# Load either format automatically
model = AutoModel.from_pretrained("./model")
```
### ONNX Export
Export for optimized inference:
```python
from transformers.onnx import export
# Export to ONNX
export(
tokenizer=tokenizer,
model=model,
config=config,
output=Path("model.onnx")
)
```
## Best Practices
1. **Use AutoModel classes**: Automatic architecture detection
2. **Specify dtype explicitly**: Control precision and memory
3. **Use device_map="auto"**: For large models
4. **Enable low_cpu_mem_usage**: When loading large models
5. **Use safetensors format**: Faster and safer serialization
6. **Check model.training**: Ensure correct mode for task
7. **Consider quantization**: For deployment on resource-constrained devices
8. **Cache models locally**: Set TRANSFORMERS_CACHE environment variable
## Common Issues
**CUDA out of memory:**
```python
# Use smaller precision
model = AutoModel.from_pretrained("model-id", torch_dtype=torch.float16)
# Or use quantization
model = AutoModel.from_pretrained("model-id", load_in_8bit=True)
# Or use CPU
model = AutoModel.from_pretrained("model-id", device_map="cpu")
```
**Slow loading:**
```python
# Enable low CPU memory mode
model = AutoModel.from_pretrained("model-id", low_cpu_mem_usage=True)
```
**Model not found:**
```python
# Verify model ID on hub.co
# Check authentication for private models
from huggingface_hub import login
login()
```

View File

@@ -1,234 +1,335 @@
# Transformers Pipelines # Pipeline API Reference
Pipelines provide a simple and optimized interface for inference across many machine learning tasks. They abstract away the complexity of tokenization, model invocation, and post-processing. ## Overview
## Usage Pattern Pipelines provide the simplest way to use pre-trained models for inference. They abstract away tokenization, model loading, and post-processing, offering a unified interface for dozens of tasks.
## Basic Usage
Create a pipeline by specifying a task:
```python ```python
from transformers import pipeline from transformers import pipeline
# Basic usage # Auto-select default model for task
classifier = pipeline("text-classification") pipe = pipeline("text-classification")
result = classifier("This movie was amazing!") result = pipe("This is great!")
# With specific model
classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
result = classifier("This movie was amazing!")
``` ```
## Natural Language Processing Pipelines Or specify a model:
### Text Classification ```python
pipe = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
```
## Supported Tasks
### Natural Language Processing
**text-generation**: Generate text continuations
```python
generator = pipeline("text-generation", model="gpt2")
output = generator("Once upon a time", max_length=50, num_return_sequences=2)
```
**text-classification**: Classify text into categories
```python ```python
classifier = pipeline("text-classification") classifier = pipeline("text-classification")
classifier("I love this product!") result = classifier("I love this product!") # Returns label and score
# [{'label': 'POSITIVE', 'score': 0.9998}]
``` ```
### Zero-Shot Classification **token-classification**: Label individual tokens (NER, POS tagging)
```python ```python
classifier = pipeline("zero-shot-classification") ner = pipeline("token-classification", model="dslim/bert-base-NER")
classifier("This is about climate change", candidate_labels=["politics", "science", "sports"]) entities = ner("Hugging Face is based in New York City")
``` ```
### Token Classification (NER) **question-answering**: Extract answers from context
```python
ner = pipeline("token-classification")
ner("My name is Sarah and I work at Microsoft in Seattle")
```
### Question Answering
```python ```python
qa = pipeline("question-answering") qa = pipeline("question-answering")
qa(question="What is the capital?", context="The capital of France is Paris.") result = qa(question="What is the capital?", context="Paris is the capital of France.")
``` ```
### Text Generation **fill-mask**: Predict masked tokens
```python ```python
generator = pipeline("text-generation") unmasker = pipeline("fill-mask", model="bert-base-uncased")
generator("Once upon a time", max_length=50) result = unmasker("Paris is the [MASK] of France")
``` ```
### Text2Text Generation **summarization**: Summarize long texts
```python ```python
generator = pipeline("text2text-generation", model="t5-base") summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
generator("translate English to French: Hello") summary = summarizer("Long article text...", max_length=130, min_length=30)
``` ```
### Summarization **translation**: Translate between languages
```python ```python
summarizer = pipeline("summarization") translator = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr")
summarizer("Long article text here...", max_length=130, min_length=30) result = translator("Hello, how are you?")
``` ```
### Translation **zero-shot-classification**: Classify without training data
```python ```python
translator = pipeline("translation_en_to_fr") classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
translator("Hello, how are you?") result = classifier(
"This is a course about Python programming",
candidate_labels=["education", "politics", "business"]
)
``` ```
### Fill Mask **sentiment-analysis**: Alias for text-classification focused on sentiment
```python ```python
unmasker = pipeline("fill-mask") sentiment = pipeline("sentiment-analysis")
unmasker("Paris is the [MASK] of France.") result = sentiment("This product exceeded my expectations!")
``` ```
### Feature Extraction ### Computer Vision
**image-classification**: Classify images
```python ```python
extractor = pipeline("feature-extraction") classifier = pipeline("image-classification", model="google/vit-base-patch16-224")
embeddings = extractor("This is a sentence") result = classifier("path/to/image.jpg")
# Or use PIL Image or URL
from PIL import Image
result = classifier(Image.open("image.jpg"))
``` ```
### Document Question Answering **object-detection**: Detect objects in images
```python ```python
doc_qa = pipeline("document-question-answering") detector = pipeline("object-detection", model="facebook/detr-resnet-50")
doc_qa(image="document.png", question="What is the invoice number?") results = detector("image.jpg") # Returns bounding boxes and labels
``` ```
### Table Question Answering **image-segmentation**: Segment images
```python ```python
table_qa = pipeline("table-question-answering") segmenter = pipeline("image-segmentation", model="facebook/detr-resnet-50-panoptic")
table_qa(table=data, query="How many employees?") segments = segmenter("image.jpg")
``` ```
## Computer Vision Pipelines **depth-estimation**: Estimate depth from images
### Image Classification
```python ```python
classifier = pipeline("image-classification") depth = pipeline("depth-estimation", model="Intel/dpt-large")
classifier("cat.jpg") result = depth("image.jpg")
``` ```
### Zero-Shot Image Classification **zero-shot-image-classification**: Classify images without training
```python ```python
classifier = pipeline("zero-shot-image-classification") classifier = pipeline("zero-shot-image-classification", model="openai/clip-vit-base-patch32")
classifier("cat.jpg", candidate_labels=["cat", "dog", "bird"]) result = classifier("image.jpg", candidate_labels=["cat", "dog", "bird"])
``` ```
### Object Detection ### Audio
**automatic-speech-recognition**: Transcribe speech
```python ```python
detector = pipeline("object-detection") asr = pipeline("automatic-speech-recognition", model="openai/whisper-base")
detector("street.jpg") text = asr("audio.mp3")
``` ```
### Image Segmentation **audio-classification**: Classify audio
```python ```python
segmenter = pipeline("image-segmentation") classifier = pipeline("audio-classification", model="MIT/ast-finetuned-audioset-10-10-0.4593")
segmenter("image.jpg") result = classifier("audio.wav")
``` ```
### Image-to-Image **text-to-speech**: Generate speech from text (with specific models)
```python ```python
img2img = pipeline("image-to-image", model="lllyasviel/sd-controlnet-canny") tts = pipeline("text-to-speech", model="microsoft/speecht5_tts")
img2img("input.jpg") audio = tts("Hello, this is a test")
``` ```
### Depth Estimation ### Multimodal
**visual-question-answering**: Answer questions about images
```python ```python
depth = pipeline("depth-estimation") vqa = pipeline("visual-question-answering", model="dandelin/vilt-b32-finetuned-vqa")
depth("image.jpg") result = vqa(image="image.jpg", question="What color is the car?")
``` ```
### Video Classification **document-question-answering**: Answer questions about documents
```python ```python
classifier = pipeline("video-classification") doc_qa = pipeline("document-question-answering", model="impira/layoutlm-document-qa")
classifier("video.mp4") result = doc_qa(image="document.png", question="What is the invoice number?")
``` ```
### Keypoint Matching **image-to-text**: Generate captions for images
```python ```python
matcher = pipeline("keypoint-matching") captioner = pipeline("image-to-text", model="Salesforce/blip-image-captioning-base")
matcher(image1="img1.jpg", image2="img2.jpg") caption = captioner("image.jpg")
``` ```
## Audio Pipelines ## Pipeline Parameters
### Automatic Speech Recognition
```python
asr = pipeline("automatic-speech-recognition")
asr("audio.wav")
```
### Audio Classification
```python
classifier = pipeline("audio-classification")
classifier("audio.wav")
```
### Zero-Shot Audio Classification
```python
classifier = pipeline("zero-shot-audio-classification")
classifier("audio.wav", candidate_labels=["speech", "music", "noise"])
```
### Text-to-Audio/Text-to-Speech
```python
synthesizer = pipeline("text-to-audio")
audio = synthesizer("Hello, how are you today?")
```
## Multimodal Pipelines
### Image-to-Text (Image Captioning)
```python
captioner = pipeline("image-to-text")
captioner("image.jpg")
```
### Visual Question Answering
```python
vqa = pipeline("visual-question-answering")
vqa(image="image.jpg", question="What color is the car?")
```
### Image-Text-to-Text (VLMs)
```python
vlm = pipeline("image-text-to-text")
vlm(images="image.jpg", text="Describe this image in detail")
```
### Zero-Shot Object Detection
```python
detector = pipeline("zero-shot-object-detection")
detector("image.jpg", candidate_labels=["car", "person", "tree"])
```
## Pipeline Configuration
### Common Parameters ### Common Parameters
- `model`: Specify model identifier or path **model**: Model identifier or path
- `device`: Set device (0 for GPU, -1 for CPU, or "cuda:0")
- `batch_size`: Process multiple inputs at once
- `torch_dtype`: Set precision (torch.float16, torch.bfloat16)
```python ```python
# GPU with half precision pipe = pipeline("task", model="model-id")
pipe = pipeline("text-generation", model="gpt2", device=0, torch_dtype=torch.float16)
# Batch processing
pipe(["text 1", "text 2", "text 3"], batch_size=8)
``` ```
### Task-Specific Parameters **device**: GPU device index (-1 for CPU, 0+ for GPU)
```python
pipe = pipeline("task", device=0) # Use first GPU
```
Each pipeline accepts task-specific parameters in the call: **device_map**: Automatic device allocation for large models
```python
pipe = pipeline("task", model="large-model", device_map="auto")
```
**dtype**: Model precision (reduces memory)
```python
import torch
pipe = pipeline("task", torch_dtype=torch.float16)
```
**batch_size**: Process multiple inputs at once
```python
pipe = pipeline("task", batch_size=8)
results = pipe(["text1", "text2", "text3"])
```
**framework**: Choose PyTorch or TensorFlow
```python
pipe = pipeline("task", framework="pt") # or "tf"
```
## Batch Processing
Process multiple inputs efficiently:
```python ```python
# Text generation classifier = pipeline("text-classification")
generator("prompt", max_length=100, temperature=0.7, top_p=0.9, num_return_sequences=3) texts = ["Great product!", "Terrible experience", "Just okay"]
results = classifier(texts)
```
# Summarization For large datasets, use generators or KeyDataset:
summarizer("text", max_length=130, min_length=30, do_sample=False)
# Translation ```python
translator("text", max_length=512, num_beams=4) from transformers.pipelines.pt_utils import KeyDataset
import datasets
dataset = datasets.load_dataset("dataset-name", split="test")
pipe = pipeline("task", device=0)
for output in pipe(KeyDataset(dataset, "text")):
print(output)
```
## Performance Optimization
### GPU Acceleration
Always specify device for GPU usage:
```python
pipe = pipeline("task", device=0)
```
### Mixed Precision
Use float16 for 2x speedup on supported GPUs:
```python
import torch
pipe = pipeline("task", torch_dtype=torch.float16, device=0)
```
### Batching Guidelines
- **CPU**: Usually skip batching
- **GPU with variable lengths**: May reduce efficiency
- **GPU with similar lengths**: Significant speedup
- **Real-time applications**: Skip batching (increases latency)
```python
# Good for throughput
pipe = pipeline("task", batch_size=32, device=0)
results = pipe(list_of_texts)
```
### Streaming Output
For text generation, stream tokens as they're generated:
```python
from transformers import TextStreamer
generator = pipeline("text-generation", model="gpt2", streamer=TextStreamer())
generator("The future of AI", max_length=100)
```
## Custom Pipeline Configuration
Specify tokenizer and model separately:
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("model-id")
model = AutoModelForSequenceClassification.from_pretrained("model-id")
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
```
Use custom pipeline classes:
```python
from transformers import TextClassificationPipeline
class CustomPipeline(TextClassificationPipeline):
def postprocess(self, model_outputs, **kwargs):
# Custom post-processing
return super().postprocess(model_outputs, **kwargs)
pipe = pipeline("text-classification", model="model-id", pipeline_class=CustomPipeline)
```
## Input Formats
Pipelines accept various input types:
**Text tasks**: Strings or lists of strings
```python
pipe("single text")
pipe(["text1", "text2"])
```
**Image tasks**: URLs, file paths, PIL Images, or numpy arrays
```python
pipe("https://example.com/image.jpg")
pipe("local/path/image.png")
pipe(PIL.Image.open("image.jpg"))
pipe(numpy_array)
```
**Audio tasks**: File paths, numpy arrays, or raw waveforms
```python
pipe("audio.mp3")
pipe(audio_array)
```
## Error Handling
Handle common issues:
```python
try:
result = pipe(input_data)
except Exception as e:
if "CUDA out of memory" in str(e):
# Reduce batch size or use CPU
pipe = pipeline("task", device=-1)
elif "does not appear to have a file named" in str(e):
# Model not found
print("Check model identifier")
else:
raise
``` ```
## Best Practices ## Best Practices
1. **Reuse pipelines**: Create once, use multiple times for efficiency 1. **Use pipelines for prototyping**: Fast iteration without boilerplate
2. **Batch processing**: Use batches for multiple inputs to maximize throughput 2. **Specify models explicitly**: Default models may change
3. **GPU acceleration**: Set `device=0` for GPU when available 3. **Enable GPU when available**: Significant speedup
4. **Model selection**: Choose task-specific models for best results 4. **Use batching for throughput**: When processing many inputs
5. **Memory management**: Use `torch_dtype=torch.float16` for large models 5. **Consider memory usage**: Use float16 or smaller models for large batches
6. **Cache models locally**: Avoid repeated downloads

View File

@@ -1,599 +0,0 @@
# Common Task Patterns
This document provides common patterns and workflows for typical tasks using Transformers.
## Text Classification
### Binary or Multi-class Classification
```python
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
TrainingArguments,
Trainer
)
from datasets import load_dataset
import evaluate
import numpy as np
# Load dataset
dataset = load_dataset("imdb")
# Tokenize
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
# Load model
id2label = {0: "negative", 1: "positive"}
label2id = {"negative": 0, "positive": 1}
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=2,
id2label=id2label,
label2id=label2id
)
# Metrics
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
# Train
training_args = TrainingArguments(
output_dir="./results",
eval_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
num_train_epochs=3,
weight_decay=0.01,
load_best_model_at_end=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["test"],
compute_metrics=compute_metrics,
)
trainer.train()
# Inference
text = "This movie was fantastic!"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
predictions = outputs.logits.argmax(-1)
print(id2label[predictions.item()])
```
## Named Entity Recognition (Token Classification)
```python
from transformers import (
AutoTokenizer,
AutoModelForTokenClassification,
TrainingArguments,
Trainer,
DataCollatorForTokenClassification
)
from datasets import load_dataset
import evaluate
import numpy as np
# Load dataset
dataset = load_dataset("conll2003")
# Tokenize (align labels with tokenized words)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
def tokenize_and_align_labels(examples):
tokenized_inputs = tokenizer(
examples["tokens"],
truncation=True,
is_split_into_words=True
)
labels = []
for i, label in enumerate(examples["ner_tags"]):
word_ids = tokenized_inputs.word_ids(batch_index=i)
label_ids = []
previous_word_idx = None
for word_idx in word_ids:
if word_idx is None:
label_ids.append(-100)
elif word_idx != previous_word_idx:
label_ids.append(label[word_idx])
else:
label_ids.append(-100)
previous_word_idx = word_idx
labels.append(label_ids)
tokenized_inputs["labels"] = labels
return tokenized_inputs
tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=True)
# Model
label_list = dataset["train"].features["ner_tags"].feature.names
model = AutoModelForTokenClassification.from_pretrained(
"bert-base-uncased",
num_labels=len(label_list)
)
# Data collator
data_collator = DataCollatorForTokenClassification(tokenizer)
# Metrics
metric = evaluate.load("seqeval")
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=2)
true_labels = [[label_list[l] for l in label if l != -100] for label in labels]
true_predictions = [
[label_list[p] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(predictions, labels)
]
return metric.compute(predictions=true_predictions, references=true_labels)
# Train
training_args = TrainingArguments(
output_dir="./results",
eval_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
data_collator=data_collator,
compute_metrics=compute_metrics,
)
trainer.train()
```
## Question Answering
```python
from transformers import (
AutoTokenizer,
AutoModelForQuestionAnswering,
TrainingArguments,
Trainer,
DefaultDataCollator
)
from datasets import load_dataset
# Load dataset
dataset = load_dataset("squad")
# Tokenize
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
def preprocess_function(examples):
questions = [q.strip() for q in examples["question"]]
inputs = tokenizer(
questions,
examples["context"],
max_length=384,
truncation="only_second",
return_offsets_mapping=True,
padding="max_length",
)
offset_mapping = inputs.pop("offset_mapping")
answers = examples["answers"]
start_positions = []
end_positions = []
for i, offset in enumerate(offset_mapping):
answer = answers[i]
start_char = answer["answer_start"][0]
end_char = start_char + len(answer["text"][0])
# Find start and end token positions
sequence_ids = inputs.sequence_ids(i)
context_start = sequence_ids.index(1)
context_end = len(sequence_ids) - 1 - sequence_ids[::-1].index(1)
if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
start_positions.append(0)
end_positions.append(0)
else:
idx = context_start
while idx <= context_end and offset[idx][0] <= start_char:
idx += 1
start_positions.append(idx - 1)
idx = context_end
while idx >= context_start and offset[idx][1] >= end_char:
idx -= 1
end_positions.append(idx + 1)
inputs["start_positions"] = start_positions
inputs["end_positions"] = end_positions
return inputs
tokenized_datasets = dataset.map(preprocess_function, batched=True, remove_columns=dataset["train"].column_names)
# Model
model = AutoModelForQuestionAnswering.from_pretrained("bert-base-uncased")
# Train
training_args = TrainingArguments(
output_dir="./results",
eval_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
num_train_epochs=3,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
data_collator=DefaultDataCollator(),
)
trainer.train()
# Inference
question = "What is the capital of France?"
context = "Paris is the capital and most populous city of France."
inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)
start_pos = outputs.start_logits.argmax()
end_pos = outputs.end_logits.argmax()
answer_tokens = inputs.input_ids[0][start_pos:end_pos+1]
answer = tokenizer.decode(answer_tokens)
```
## Text Summarization
```python
from transformers import (
AutoTokenizer,
AutoModelForSeq2SeqLM,
TrainingArguments,
Trainer,
DataCollatorForSeq2Seq
)
from datasets import load_dataset
import evaluate
# Load dataset
dataset = load_dataset("cnn_dailymail", "3.0.0")
# Tokenize
tokenizer = AutoTokenizer.from_pretrained("t5-small")
def preprocess_function(examples):
inputs = ["summarize: " + doc for doc in examples["article"]]
model_inputs = tokenizer(inputs, max_length=1024, truncation=True)
labels = tokenizer(
text_target=examples["highlights"],
max_length=128,
truncation=True
)
model_inputs["labels"] = labels["input_ids"]
return model_inputs
tokenized_datasets = dataset.map(preprocess_function, batched=True)
# Model
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
# Data collator
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
# Metrics
rouge = evaluate.load("rouge")
def compute_metrics(eval_pred):
predictions, labels = eval_pred
decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
result = rouge.compute(
predictions=decoded_preds,
references=decoded_labels,
use_stemmer=True
)
return {k: round(v, 4) for k, v in result.items()}
# Train
training_args = TrainingArguments(
output_dir="./results",
eval_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
num_train_epochs=3,
predict_with_generate=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
data_collator=data_collator,
compute_metrics=compute_metrics,
)
trainer.train()
# Inference
text = "Long article text..."
inputs = tokenizer("summarize: " + text, return_tensors="pt", max_length=1024, truncation=True)
outputs = model.generate(**inputs, max_length=128, num_beams=4)
summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
```
## Translation
```python
from transformers import (
AutoTokenizer,
AutoModelForSeq2SeqLM,
TrainingArguments,
Trainer,
DataCollatorForSeq2Seq
)
from datasets import load_dataset
# Load dataset
dataset = load_dataset("wmt16", "de-en")
# Tokenize
tokenizer = AutoTokenizer.from_pretrained("t5-small")
def preprocess_function(examples):
inputs = [f"translate German to English: {de}" for de in examples["de"]]
model_inputs = tokenizer(inputs, max_length=128, truncation=True)
labels = tokenizer(
text_target=examples["en"],
max_length=128,
truncation=True
)
model_inputs["labels"] = labels["input_ids"]
return model_inputs
tokenized_datasets = dataset.map(preprocess_function, batched=True)
# Model and training (similar to summarization)
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
# Inference
text = "Guten Tag, wie geht es Ihnen?"
inputs = tokenizer(f"translate German to English: {text}", return_tensors="pt")
outputs = model.generate(**inputs, max_length=128)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
```
## Causal Language Modeling (Training from Scratch or Fine-tuning)
```python
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
TrainingArguments,
Trainer,
DataCollatorForLanguageModeling
)
from datasets import load_dataset
# Load dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")
# Tokenize
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(examples):
return tokenizer(examples["text"], truncation=True, max_length=512)
tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["text"])
# Group texts into chunks
block_size = 128
def group_texts(examples):
concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
total_length = (total_length // block_size) * block_size
result = {
k: [t[i:i + block_size] for i in range(0, total_length, block_size)]
for k, t in concatenated_examples.items()
}
result["labels"] = result["input_ids"].copy()
return result
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
# Model
model = AutoModelForCausalLM.from_pretrained("gpt2")
# Data collator
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
# Train
training_args = TrainingArguments(
output_dir="./results",
eval_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=8,
num_train_epochs=3,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=lm_datasets["train"],
eval_dataset=lm_datasets["validation"],
data_collator=data_collator,
)
trainer.train()
```
## Image Classification
```python
from transformers import (
AutoImageProcessor,
AutoModelForImageClassification,
TrainingArguments,
Trainer
)
from datasets import load_dataset
from torchvision.transforms import Compose, Resize, ToTensor, Normalize
import numpy as np
import evaluate
# Load dataset
dataset = load_dataset("food101", split="train[:5000]")
# Prepare image transforms
image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std)
size = image_processor.size["height"]
transforms = Compose([
Resize((size, size)),
ToTensor(),
normalize,
])
def preprocess_function(examples):
examples["pixel_values"] = [transforms(img.convert("RGB")) for img in examples["image"]]
return examples
dataset = dataset.with_transform(preprocess_function)
# Model
model = AutoModelForImageClassification.from_pretrained(
"google/vit-base-patch16-224",
num_labels=len(dataset["train"].features["label"].names),
ignore_mismatched_sizes=True
)
# Metrics
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
predictions = np.argmax(eval_pred.predictions, axis=1)
return metric.compute(predictions=predictions, references=eval_pred.label_ids)
# Train
training_args = TrainingArguments(
output_dir="./results",
eval_strategy="epoch",
learning_rate=5e-5,
per_device_train_batch_size=16,
num_train_epochs=3,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
compute_metrics=compute_metrics,
)
trainer.train()
```
## Vision-Language Tasks (Image Captioning)
```python
from transformers import (
AutoProcessor,
AutoModelForVision2Seq,
TrainingArguments,
Trainer
)
from datasets import load_dataset
from PIL import Image
# Load dataset
dataset = load_dataset("ybelkada/football-dataset")
# Processor
processor = AutoProcessor.from_pretrained("microsoft/git-base")
def preprocess_function(examples):
images = [Image.open(img).convert("RGB") for img in examples["image"]]
texts = examples["caption"]
inputs = processor(images=images, text=texts, padding="max_length", truncation=True)
inputs["labels"] = inputs["input_ids"]
return inputs
dataset = dataset.map(preprocess_function, batched=True)
# Model
model = AutoModelForVision2Seq.from_pretrained("microsoft/git-base")
# Train
training_args = TrainingArguments(
output_dir="./results",
eval_strategy="epoch",
learning_rate=5e-5,
per_device_train_batch_size=8,
num_train_epochs=3,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
)
trainer.train()
# Inference
image = Image.open("image.jpg")
inputs = processor(images=image, return_tensors="pt")
outputs = model.generate(**inputs)
caption = processor.decode(outputs[0], skip_special_tokens=True)
```
## Best Practices Summary
1. **Use appropriate Auto* classes**: AutoTokenizer, AutoModel, etc. for model loading
2. **Proper preprocessing**: Tokenize, align labels, handle special cases
3. **Data collators**: Use appropriate collators for dynamic padding
4. **Metrics**: Load and compute relevant metrics for evaluation
5. **Training arguments**: Configure properly for task and hardware
6. **Inference**: Use pipeline() for quick inference, or manual tokenization for custom needs

View File

@@ -0,0 +1,447 @@
# Tokenizers
## Overview
Tokenizers convert text into numerical representations (tokens) that models can process. They handle special tokens, padding, truncation, and attention masks.
## Loading Tokenizers
### AutoTokenizer
Automatically load the correct tokenizer for a model:
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
```
Load from local path:
```python
tokenizer = AutoTokenizer.from_pretrained("./local/tokenizer/path")
```
## Basic Tokenization
### Encode Text
```python
# Simple encoding
text = "Hello, how are you?"
tokens = tokenizer.encode(text)
print(tokens) # [101, 7592, 1010, 2129, 2024, 2017, 1029, 102]
# With text tokenization
tokens = tokenizer.tokenize(text)
print(tokens) # ['hello', ',', 'how', 'are', 'you', '?']
```
### Decode Tokens
```python
token_ids = [101, 7592, 1010, 2129, 2024, 2017, 1029, 102]
text = tokenizer.decode(token_ids)
print(text) # "hello, how are you?"
# Skip special tokens
text = tokenizer.decode(token_ids, skip_special_tokens=True)
print(text) # "hello, how are you?"
```
## The `__call__` Method
Primary tokenization interface:
```python
# Single text
inputs = tokenizer("Hello, how are you?")
# Returns dictionary with input_ids, attention_mask
print(inputs)
# {
# 'input_ids': [101, 7592, 1010, 2129, 2024, 2017, 1029, 102],
# 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]
# }
```
Multiple texts:
```python
texts = ["Hello", "How are you?"]
inputs = tokenizer(texts, padding=True, truncation=True)
```
## Key Parameters
### Return Tensors
**return_tensors**: Output format ("pt", "tf", "np")
```python
# PyTorch tensors
inputs = tokenizer("text", return_tensors="pt")
# TensorFlow tensors
inputs = tokenizer("text", return_tensors="tf")
# NumPy arrays
inputs = tokenizer("text", return_tensors="np")
```
### Padding
**padding**: Pad sequences to same length
```python
# Pad to longest sequence in batch
inputs = tokenizer(texts, padding=True)
# Pad to specific length
inputs = tokenizer(texts, padding="max_length", max_length=128)
# No padding
inputs = tokenizer(texts, padding=False)
```
**pad_to_multiple_of**: Pad to multiple of specified value
```python
inputs = tokenizer(texts, padding=True, pad_to_multiple_of=8)
```
### Truncation
**truncation**: Limit sequence length
```python
# Truncate to max_length
inputs = tokenizer(text, truncation=True, max_length=512)
# Truncate first sequence in pairs
inputs = tokenizer(text1, text2, truncation="only_first")
# Truncate second sequence
inputs = tokenizer(text1, text2, truncation="only_second")
# Truncate longest first (default for pairs)
inputs = tokenizer(text1, text2, truncation="longest_first", max_length=512)
```
### Max Length
**max_length**: Maximum sequence length
```python
inputs = tokenizer(text, max_length=512, truncation=True)
```
### Additional Outputs
**return_attention_mask**: Include attention mask (default True)
```python
inputs = tokenizer(text, return_attention_mask=True)
```
**return_token_type_ids**: Segment IDs for sentence pairs
```python
inputs = tokenizer(text1, text2, return_token_type_ids=True)
```
**return_offsets_mapping**: Character position mapping (Fast tokenizers only)
```python
inputs = tokenizer(text, return_offsets_mapping=True)
```
**return_length**: Include sequence lengths
```python
inputs = tokenizer(texts, padding=True, return_length=True)
```
## Special Tokens
### Predefined Special Tokens
Access special tokens:
```python
print(tokenizer.cls_token) # [CLS] or <s>
print(tokenizer.sep_token) # [SEP] or </s>
print(tokenizer.pad_token) # [PAD]
print(tokenizer.unk_token) # [UNK]
print(tokenizer.mask_token) # [MASK]
print(tokenizer.eos_token) # End of sequence
print(tokenizer.bos_token) # Beginning of sequence
# Get IDs
print(tokenizer.cls_token_id)
print(tokenizer.sep_token_id)
```
### Add Special Tokens
Manual control:
```python
# Automatically add special tokens (default True)
inputs = tokenizer(text, add_special_tokens=True)
# Skip special tokens
inputs = tokenizer(text, add_special_tokens=False)
```
### Custom Special Tokens
```python
special_tokens_dict = {
"additional_special_tokens": ["<CUSTOM>", "<SPECIAL>"]
}
num_added = tokenizer.add_special_tokens(special_tokens_dict)
print(f"Added {num_added} tokens")
# Resize model embeddings after adding tokens
model.resize_token_embeddings(len(tokenizer))
```
## Sentence Pairs
Tokenize text pairs:
```python
text1 = "What is the capital of France?"
text2 = "Paris is the capital of France."
# Automatically handles separation
inputs = tokenizer(text1, text2, padding=True, truncation=True)
# Results in: [CLS] text1 [SEP] text2 [SEP]
```
## Batch Encoding
Process multiple texts:
```python
texts = ["First text", "Second text", "Third text"]
# Basic batch encoding
batch = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
# Access individual encodings
for i in range(len(texts)):
input_ids = batch["input_ids"][i]
attention_mask = batch["attention_mask"][i]
```
## Fast Tokenizers
Use Rust-based tokenizers for speed:
```python
from transformers import AutoTokenizer
# Automatically loads Fast version if available
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Check if Fast
print(tokenizer.is_fast) # True
# Force Fast tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
# Force slow (Python) tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=False)
```
### Fast Tokenizer Features
**Offset mapping** (character positions):
```python
inputs = tokenizer("Hello world", return_offsets_mapping=True)
print(inputs["offset_mapping"])
# [(0, 0), (0, 5), (6, 11), (0, 0)] # [CLS], "Hello", "world", [SEP]
```
**Token to word mapping**:
```python
encoding = tokenizer("Hello world")
word_ids = encoding.word_ids()
print(word_ids) # [None, 0, 1, None] # [CLS]=None, "Hello"=0, "world"=1, [SEP]=None
```
## Saving Tokenizers
Save locally:
```python
tokenizer.save_pretrained("./my_tokenizer")
```
Push to Hub:
```python
tokenizer.push_to_hub("username/my-tokenizer")
```
## Advanced Usage
### Vocabulary
Access vocabulary:
```python
vocab = tokenizer.get_vocab()
vocab_size = len(vocab)
# Get token for ID
token = tokenizer.convert_ids_to_tokens(100)
# Get ID for token
token_id = tokenizer.convert_tokens_to_ids("hello")
```
### Encoding Details
Get detailed encoding information:
```python
encoding = tokenizer("Hello world", return_tensors="pt")
# Original methods still available
tokens = encoding.tokens()
word_ids = encoding.word_ids()
sequence_ids = encoding.sequence_ids()
```
### Custom Preprocessing
Subclass for custom behavior:
```python
class CustomTokenizer(AutoTokenizer):
def __call__(self, text, **kwargs):
# Custom preprocessing
text = text.lower().strip()
return super().__call__(text, **kwargs)
```
## Chat Templates
For conversational models:
```python
messages = [
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": "Hello!"},
{"role": "assistant", "content": "Hi there!"},
{"role": "user", "content": "How are you?"}
]
# Apply chat template
text = tokenizer.apply_chat_template(messages, tokenize=False)
print(text)
# Tokenize directly
inputs = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt")
```
## Common Patterns
### Pattern 1: Simple Text Classification
```python
texts = ["I love this!", "I hate this!"]
labels = [1, 0]
inputs = tokenizer(
texts,
padding=True,
truncation=True,
max_length=512,
return_tensors="pt"
)
# Use with model
outputs = model(**inputs, labels=torch.tensor(labels))
```
### Pattern 2: Question Answering
```python
question = "What is the capital?"
context = "Paris is the capital of France."
inputs = tokenizer(
question,
context,
padding=True,
truncation=True,
max_length=384,
return_tensors="pt"
)
```
### Pattern 3: Text Generation
```python
prompt = "Once upon a time"
inputs = tokenizer(prompt, return_tensors="pt")
# Generate
outputs = model.generate(
inputs["input_ids"],
max_new_tokens=50,
pad_token_id=tokenizer.eos_token_id
)
# Decode
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
```
### Pattern 4: Dataset Tokenization
```python
def tokenize_function(examples):
return tokenizer(
examples["text"],
padding="max_length",
truncation=True,
max_length=512
)
# Apply to dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)
```
## Best Practices
1. **Always specify return_tensors**: For model input
2. **Use padding and truncation**: For batch processing
3. **Set max_length explicitly**: Prevent memory issues
4. **Use Fast tokenizers**: When available for speed
5. **Handle pad_token**: Set to eos_token if None for generation
6. **Add special tokens**: Leave enabled (default) unless specific reason
7. **Resize embeddings**: After adding custom tokens
8. **Decode with skip_special_tokens**: For cleaner output
9. **Use batched processing**: For efficiency with datasets
10. **Save tokenizer with model**: Ensure compatibility
## Common Issues
**Padding token not set:**
```python
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
```
**Sequence too long:**
```python
# Enable truncation
inputs = tokenizer(text, truncation=True, max_length=512)
```
**Mismatched vocabulary:**
```python
# Always load tokenizer and model from same checkpoint
tokenizer = AutoTokenizer.from_pretrained("model-id")
model = AutoModel.from_pretrained("model-id")
```
**Attention mask issues:**
```python
# Ensure attention_mask is passed
outputs = model(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"]
)
```

View File

@@ -1,182 +1,50 @@
# Training with Transformers # Training and Fine-Tuning
Transformers provides comprehensive training capabilities through the `Trainer` API, supporting distributed training, mixed precision, and advanced optimization techniques. ## Overview
## Basic Training Workflow Fine-tune pre-trained models on custom datasets using the Trainer API. The Trainer handles training loops, gradient accumulation, mixed precision, logging, and checkpointing.
## Basic Fine-Tuning Workflow
### Step 1: Load and Preprocess Data
```python ```python
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
Trainer,
TrainingArguments
)
from datasets import load_dataset from datasets import load_dataset
# 1. Load and preprocess data # Load dataset
dataset = load_dataset("imdb") dataset = load_dataset("yelp_review_full")
train_dataset = dataset["train"]
eval_dataset = dataset["test"]
# Tokenize
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
def tokenize_function(examples): def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True) return tokenizer(
examples["text"],
padding="max_length",
truncation=True,
max_length=512
)
tokenized_datasets = dataset.map(tokenize_function, batched=True) train_dataset = train_dataset.map(tokenize_function, batched=True)
eval_dataset = eval_dataset.map(tokenize_function, batched=True)
```
### Step 2: Load Model
```python
from transformers import AutoModelForSequenceClassification
# 2. Load model
model = AutoModelForSequenceClassification.from_pretrained( model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased", "bert-base-uncased",
num_labels=2 num_labels=5 # Number of classes
)
# 3. Define training arguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
learning_rate=2e-5,
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
)
# 4. Create trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["test"],
)
# 5. Train
trainer.train()
# 6. Evaluate
trainer.evaluate()
# 7. Save model
trainer.save_model("./final_model")
```
## TrainingArguments Configuration
### Essential Parameters
**Output and Logging:**
- `output_dir`: Directory for checkpoints and outputs (required)
- `logging_dir`: TensorBoard log directory (default: `{output_dir}/runs`)
- `logging_steps`: Log every N steps (default: 500)
- `logging_strategy`: "steps" or "epoch"
**Training Duration:**
- `num_train_epochs`: Number of epochs (default: 3.0)
- `max_steps`: Max training steps (overrides num_train_epochs if set)
**Batch Size and Gradient Accumulation:**
- `per_device_train_batch_size`: Batch size per device (default: 8)
- `per_device_eval_batch_size`: Eval batch size per device (default: 8)
- `gradient_accumulation_steps`: Accumulate gradients over N steps (default: 1)
- Effective batch size = `per_device_train_batch_size * gradient_accumulation_steps * num_gpus`
**Learning Rate:**
- `learning_rate`: Peak learning rate (default: 5e-5)
- `lr_scheduler_type`: Scheduler type ("linear", "cosine", "constant", etc.)
- `warmup_steps`: Warmup steps (default: 0)
- `warmup_ratio`: Warmup as fraction of total steps
**Evaluation:**
- `eval_strategy`: "no", "steps", or "epoch" (default: "no")
- `eval_steps`: Evaluate every N steps (if eval_strategy="steps")
- `eval_delay`: Delay evaluation until N steps
**Checkpointing:**
- `save_strategy`: "no", "steps", or "epoch" (default: "steps")
- `save_steps`: Save checkpoint every N steps (default: 500)
- `save_total_limit`: Keep only N most recent checkpoints
- `load_best_model_at_end`: Load best checkpoint at end (default: False)
- `metric_for_best_model`: Metric to determine best model
**Optimization:**
- `optim`: Optimizer ("adamw_torch", "adamw_hf", "sgd", etc.)
- `weight_decay`: Weight decay coefficient (default: 0.0)
- `adam_beta1`, `adam_beta2`: Adam optimizer betas
- `adam_epsilon`: Epsilon for Adam (default: 1e-8)
- `max_grad_norm`: Max gradient norm for clipping (default: 1.0)
### Mixed Precision Training
```python
training_args = TrainingArguments(
output_dir="./results",
fp16=True, # Use fp16 on NVIDIA GPUs
fp16_opt_level="O1", # O0, O1, O2, O3 (Apex levels)
# or
bf16=True, # Use bf16 on Ampere+ GPUs (better than fp16)
) )
``` ```
### Distributed Training ### Step 3: Define Metrics
**DataParallel (single-node multi-GPU):**
```python
# Automatic with multiple GPUs
training_args = TrainingArguments(
output_dir="./results",
per_device_train_batch_size=16, # Per GPU
)
# Run: python script.py
```
**DistributedDataParallel (multi-node or multi-GPU):**
```bash
# Single node, multiple GPUs
python -m torch.distributed.launch --nproc_per_node=4 script.py
# Or use accelerate
accelerate config
accelerate launch script.py
```
**DeepSpeed Integration:**
```python
training_args = TrainingArguments(
output_dir="./results",
deepspeed="ds_config.json", # DeepSpeed config file
)
```
### Advanced Features
**Gradient Checkpointing (reduce memory):**
```python
training_args = TrainingArguments(
output_dir="./results",
gradient_checkpointing=True,
)
```
**Compilation with torch.compile:**
```python
training_args = TrainingArguments(
output_dir="./results",
torch_compile=True,
torch_compile_backend="inductor", # or "cudagraphs"
)
```
**Push to Hub:**
```python
training_args = TrainingArguments(
output_dir="./results",
push_to_hub=True,
hub_model_id="username/model-name",
hub_strategy="every_save", # or "end"
)
```
## Custom Training Components
### Custom Metrics
```python ```python
import evaluate import evaluate
@@ -188,32 +56,195 @@ def compute_metrics(eval_pred):
logits, labels = eval_pred logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1) predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels) return metric.compute(predictions=predictions, references=labels)
```
### Step 4: Configure Training
```python
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./results",
eval_strategy="epoch",
save_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
num_train_epochs=3,
weight_decay=0.01,
logging_dir="./logs",
logging_steps=10,
load_best_model_at_end=True,
metric_for_best_model="accuracy",
)
```
### Step 5: Create Trainer and Train
```python
from transformers import Trainer
trainer = Trainer( trainer = Trainer(
model=model, model=model,
args=training_args, args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics, compute_metrics=compute_metrics,
) )
# Start training
trainer.train()
# Evaluate
results = trainer.evaluate()
print(results)
``` ```
### Custom Loss Function ### Step 6: Save Model
```python ```python
class CustomTrainer(Trainer): trainer.save_model("./fine_tuned_model")
def compute_loss(self, model, inputs, return_outputs=False): tokenizer.save_pretrained("./fine_tuned_model")
labels = inputs.pop("labels")
outputs = model(**inputs)
logits = outputs.logits
# Custom loss calculation # Or push to Hub
loss_fct = torch.nn.CrossEntropyLoss(weight=class_weights) trainer.push_to_hub("username/my-finetuned-model")
loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
return (loss, outputs) if return_outputs else loss
``` ```
### Data Collator ## TrainingArguments Parameters
### Essential Parameters
**output_dir**: Directory for checkpoints and logs
```python
output_dir="./results"
```
**num_train_epochs**: Number of training epochs
```python
num_train_epochs=3
```
**per_device_train_batch_size**: Batch size per GPU/CPU
```python
per_device_train_batch_size=8
```
**learning_rate**: Optimizer learning rate
```python
learning_rate=2e-5 # Common for BERT-style models
learning_rate=5e-5 # Common for smaller models
```
**weight_decay**: L2 regularization
```python
weight_decay=0.01
```
### Evaluation and Saving
**eval_strategy**: When to evaluate ("no", "steps", "epoch")
```python
eval_strategy="epoch" # Evaluate after each epoch
eval_strategy="steps" # Evaluate every eval_steps
```
**save_strategy**: When to save checkpoints
```python
save_strategy="epoch"
save_strategy="steps"
save_steps=500
```
**load_best_model_at_end**: Load best checkpoint after training
```python
load_best_model_at_end=True
metric_for_best_model="accuracy" # Metric to compare
```
### Optimization
**gradient_accumulation_steps**: Accumulate gradients over multiple steps
```python
gradient_accumulation_steps=4 # Effective batch size = batch_size * 4
```
**fp16**: Enable mixed precision (NVIDIA GPUs)
```python
fp16=True
```
**bf16**: Enable bfloat16 (newer GPUs)
```python
bf16=True
```
**gradient_checkpointing**: Trade compute for memory
```python
gradient_checkpointing=True # Slower but uses less memory
```
**optim**: Optimizer choice
```python
optim="adamw_torch" # Default
optim="adamw_8bit" # 8-bit Adam (requires bitsandbytes)
optim="adafactor" # Memory-efficient alternative
```
### Learning Rate Scheduling
**lr_scheduler_type**: Learning rate schedule
```python
lr_scheduler_type="linear" # Linear decay
lr_scheduler_type="cosine" # Cosine annealing
lr_scheduler_type="constant" # No decay
lr_scheduler_type="constant_with_warmup"
```
**warmup_steps** or **warmup_ratio**: Warmup period
```python
warmup_steps=500
# Or
warmup_ratio=0.1 # 10% of total steps
```
### Logging
**logging_dir**: TensorBoard logs directory
```python
logging_dir="./logs"
```
**logging_steps**: Log every N steps
```python
logging_steps=10
```
**report_to**: Logging integrations
```python
report_to=["tensorboard"]
report_to=["wandb"]
report_to=["tensorboard", "wandb"]
```
### Distributed Training
**ddp_backend**: Distributed backend
```python
ddp_backend="nccl" # For multi-GPU
```
**deepspeed**: DeepSpeed config file
```python
deepspeed="ds_config.json"
```
## Data Collators
Handle dynamic padding and special preprocessing:
### DataCollatorWithPadding
Pad sequences to longest in batch:
```python ```python
from transformers import DataCollatorWithPadding from transformers import DataCollatorWithPadding
@@ -227,102 +258,243 @@ trainer = Trainer(
) )
``` ```
### Callbacks ### DataCollatorForLanguageModeling
For masked language modeling:
```python
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=True,
mlm_probability=0.15
)
```
### DataCollatorForSeq2Seq
For sequence-to-sequence tasks:
```python
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(
tokenizer=tokenizer,
model=model,
padding=True
)
```
## Custom Training
### Custom Trainer
Override methods for custom behavior:
```python
from transformers import Trainer
class CustomTrainer(Trainer):
def compute_loss(self, model, inputs, return_outputs=False):
labels = inputs.pop("labels")
outputs = model(**inputs)
logits = outputs.logits
# Custom loss computation
loss_fct = torch.nn.CrossEntropyLoss(weight=class_weights)
loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
return (loss, outputs) if return_outputs else loss
```
### Custom Callbacks
Monitor and control training:
```python ```python
from transformers import TrainerCallback from transformers import TrainerCallback
class CustomCallback(TrainerCallback): class CustomCallback(TrainerCallback):
def on_epoch_end(self, args, state, control, **kwargs): def on_epoch_end(self, args, state, control, **kwargs):
print(f"Epoch {state.epoch} completed!") print(f"Epoch {state.epoch} completed")
# Custom logic here
return control return control
trainer = Trainer( trainer = Trainer(
model=model, model=model,
args=training_args, args=training_args,
train_dataset=train_dataset,
callbacks=[CustomCallback], callbacks=[CustomCallback],
) )
``` ```
## Hyperparameter Search ## Advanced Training Techniques
### Parameter-Efficient Fine-Tuning (PEFT)
Use LoRA for efficient fine-tuning:
```python
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["query", "value"],
lora_dropout=0.05,
bias="none",
task_type="SEQ_CLS"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters() # Shows reduced parameter count
# Train normally with Trainer
trainer = Trainer(model=model, args=training_args, ...)
trainer.train()
```
### Gradient Checkpointing
Reduce memory at cost of speed:
```python
model.gradient_checkpointing_enable()
training_args = TrainingArguments(
gradient_checkpointing=True,
...
)
```
### Mixed Precision Training
```python
training_args = TrainingArguments(
fp16=True, # For NVIDIA GPUs with Tensor Cores
# or
bf16=True, # For newer GPUs (A100, H100)
...
)
```
### DeepSpeed Integration
For very large models:
```python
# ds_config.json
{
"train_batch_size": 16,
"gradient_accumulation_steps": 1,
"optimizer": {
"type": "AdamW",
"params": {
"lr": 2e-5
}
},
"fp16": {
"enabled": true
},
"zero_optimization": {
"stage": 2
}
}
```
```python
training_args = TrainingArguments(
deepspeed="ds_config.json",
...
)
```
## Training Tips
### Hyperparameter Tuning
Common starting points:
- **Learning rate**: 2e-5 to 5e-5 for BERT-like models, 1e-4 to 1e-3 for smaller models
- **Batch size**: 8-32 depending on GPU memory
- **Epochs**: 2-4 for fine-tuning, more for domain adaptation
- **Warmup**: 10% of total steps
Use Optuna for hyperparameter search:
```python ```python
def model_init(): def model_init():
return AutoModelForSequenceClassification.from_pretrained( return AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased", "bert-base-uncased",
num_labels=2 num_labels=5
) )
trainer = Trainer( def optuna_hp_space(trial):
model_init=model_init, return {
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
)
# Optuna-based search
best_trial = trainer.hyperparameter_search(
direction="maximize",
backend="optuna",
n_trials=10,
hp_space=lambda trial: {
"learning_rate": trial.suggest_float("learning_rate", 1e-5, 5e-5, log=True), "learning_rate": trial.suggest_float("learning_rate", 1e-5, 5e-5, log=True),
"per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 16, 32]), "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 16, 32]),
"num_train_epochs": trial.suggest_int("num_train_epochs", 2, 5), "num_train_epochs": trial.suggest_int("num_train_epochs", 2, 5),
} }
trainer = Trainer(model_init=model_init, args=training_args, ...)
best_trial = trainer.hyperparameter_search(
direction="maximize",
backend="optuna",
hp_space=optuna_hp_space,
n_trials=10,
) )
``` ```
## Training Best Practices ### Monitoring Training
1. **Start with small learning rates**: 2e-5 to 5e-5 for fine-tuning Use TensorBoard:
2. **Use warmup**: 5-10% of total steps for learning rate warmup ```bash
3. **Monitor training**: Use eval_strategy="epoch" or "steps" to track progress tensorboard --logdir ./logs
4. **Save checkpoints**: Set save_strategy and save_total_limit ```
5. **Use mixed precision**: Enable fp16 or bf16 for faster training
6. **Gradient accumulation**: For large effective batch sizes on limited memory
7. **Load best model**: Set load_best_model_at_end=True to avoid overfitting
8. **Push to Hub**: Enable push_to_hub for easy model sharing and versioning
## Common Training Patterns Or Weights & Biases:
### Classification
```python ```python
model = AutoModelForSequenceClassification.from_pretrained( import wandb
"bert-base-uncased", wandb.init(project="my-project")
num_labels=num_classes,
id2label=id2label, training_args = TrainingArguments(
label2id=label2id report_to=["wandb"],
...
) )
``` ```
### Question Answering ### Resume Training
Resume from checkpoint:
```python ```python
model = AutoModelForQuestionAnswering.from_pretrained("bert-base-uncased") trainer.train(resume_from_checkpoint="./results/checkpoint-1000")
``` ```
### Token Classification (NER) ## Common Issues
```python
model = AutoModelForTokenClassification.from_pretrained(
"bert-base-uncased",
num_labels=num_tags,
id2label=id2label,
label2id=label2id
)
```
### Sequence-to-Sequence **CUDA out of memory:**
```python - Reduce batch size
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base") - Enable gradient checkpointing
``` - Use gradient accumulation
- Use 8-bit optimizers
### Causal Language Modeling **Overfitting:**
```python - Increase weight_decay
model = AutoModelForCausalLM.from_pretrained("gpt2") - Add dropout
``` - Use early stopping
- Reduce model size or training epochs
### Masked Language Modeling **Slow training:**
```python - Increase batch size
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased") - Enable mixed precision (fp16/bf16)
``` - Use multiple GPUs
- Optimize data loading
## Best Practices
1. **Start small**: Test on small dataset subset first
2. **Use evaluation**: Monitor validation metrics
3. **Save checkpoints**: Enable save_strategy
4. **Log extensively**: Use TensorBoard or W&B
5. **Try different learning rates**: Start with 2e-5
6. **Use warmup**: Helps training stability
7. **Enable mixed precision**: Faster training
8. **Consider PEFT**: For large models with limited resources

View File

@@ -1,241 +0,0 @@
#!/usr/bin/env python3
"""
Fine-tune a transformer model for text classification.
This script demonstrates the complete workflow for fine-tuning a pre-trained
model on a classification task using the Trainer API.
"""
import numpy as np
from datasets import load_dataset
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
TrainingArguments,
Trainer,
DataCollatorWithPadding,
)
import evaluate
def load_and_prepare_data(dataset_name="imdb", model_name="distilbert-base-uncased", max_samples=None):
"""
Load dataset and tokenize.
Args:
dataset_name: Name of the dataset to load
model_name: Name of the model/tokenizer to use
max_samples: Limit number of samples (for quick testing)
Returns:
tokenized_datasets, tokenizer
"""
print(f"Loading dataset: {dataset_name}")
dataset = load_dataset(dataset_name)
# Optionally limit samples for quick testing
if max_samples:
dataset["train"] = dataset["train"].select(range(max_samples))
dataset["test"] = dataset["test"].select(range(min(max_samples, len(dataset["test"]))))
print(f"Loading tokenizer: {model_name}")
tokenizer = AutoTokenizer.from_pretrained(model_name)
def tokenize_function(examples):
return tokenizer(
examples["text"],
padding="max_length",
truncation=True,
max_length=512
)
print("Tokenizing dataset...")
tokenized_datasets = dataset.map(tokenize_function, batched=True)
return tokenized_datasets, tokenizer
def create_model(model_name, num_labels, id2label, label2id):
"""
Create classification model.
Args:
model_name: Name of the pre-trained model
num_labels: Number of classification labels
id2label: Dictionary mapping label IDs to names
label2id: Dictionary mapping label names to IDs
Returns:
model
"""
print(f"Loading model: {model_name}")
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=num_labels,
id2label=id2label,
label2id=label2id
)
return model
def define_compute_metrics(metric_name="accuracy"):
"""
Define function to compute metrics during evaluation.
Args:
metric_name: Name of the metric to use
Returns:
compute_metrics function
"""
metric = evaluate.load(metric_name)
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
return compute_metrics
def train_model(model, tokenizer, train_dataset, eval_dataset, output_dir="./results"):
"""
Train the model.
Args:
model: The model to train
tokenizer: The tokenizer
train_dataset: Training dataset
eval_dataset: Evaluation dataset
output_dir: Directory for checkpoints and logs
Returns:
trained model, trainer
"""
# Define training arguments
training_args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
learning_rate=2e-5,
weight_decay=0.01,
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="accuracy",
logging_dir=f"{output_dir}/logs",
logging_steps=100,
save_total_limit=2,
fp16=False, # Set to True if using GPU with fp16 support
)
# Create data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
# Create trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=data_collator,
compute_metrics=define_compute_metrics("accuracy"),
)
# Train
print("\nStarting training...")
trainer.train()
# Evaluate
print("\nEvaluating model...")
eval_results = trainer.evaluate()
print(f"Evaluation results: {eval_results}")
return model, trainer
def test_inference(model, tokenizer, id2label):
"""
Test the trained model with sample texts.
Args:
model: Trained model
tokenizer: Tokenizer
id2label: Dictionary mapping label IDs to names
"""
print("\n=== Testing Inference ===")
test_texts = [
"This movie was absolutely fantastic! I loved every minute of it.",
"Terrible film. Waste of time and money.",
"It was okay, nothing special but not bad either."
]
for text in test_texts:
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)
predictions = outputs.logits.argmax(-1)
predicted_label = id2label[predictions.item()]
confidence = outputs.logits.softmax(-1).max().item()
print(f"\nText: {text}")
print(f"Prediction: {predicted_label} (confidence: {confidence:.3f})")
def main():
"""Main training pipeline."""
# Configuration
DATASET_NAME = "imdb"
MODEL_NAME = "distilbert-base-uncased"
OUTPUT_DIR = "./results"
MAX_SAMPLES = None # Set to a small number (e.g., 1000) for quick testing
# Label mapping
id2label = {0: "negative", 1: "positive"}
label2id = {"negative": 0, "positive": 1}
num_labels = len(id2label)
print("=" * 60)
print("Fine-Tuning Text Classification Model")
print("=" * 60)
# Load and prepare data
tokenized_datasets, tokenizer = load_and_prepare_data(
dataset_name=DATASET_NAME,
model_name=MODEL_NAME,
max_samples=MAX_SAMPLES
)
# Create model
model = create_model(
model_name=MODEL_NAME,
num_labels=num_labels,
id2label=id2label,
label2id=label2id
)
# Train model
model, trainer = train_model(
model=model,
tokenizer=tokenizer,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["test"],
output_dir=OUTPUT_DIR
)
# Save final model
print(f"\nSaving model to {OUTPUT_DIR}/final_model")
trainer.save_model(f"{OUTPUT_DIR}/final_model")
tokenizer.save_pretrained(f"{OUTPUT_DIR}/final_model")
# Test inference
test_inference(model, tokenizer, id2label)
print("\n" + "=" * 60)
print("Training completed successfully!")
print("=" * 60)
if __name__ == "__main__":
main()

View File

@@ -1,189 +0,0 @@
#!/usr/bin/env python3
"""
Text generation with different decoding strategies.
This script demonstrates various text generation approaches using
different sampling and decoding strategies.
"""
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
def load_model_and_tokenizer(model_name="gpt2"):
"""
Load model and tokenizer.
Args:
model_name: Name of the model to load
Returns:
model, tokenizer
"""
print(f"Loading model: {model_name}")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Set pad token if not already set
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
return model, tokenizer
def generate_with_greedy(model, tokenizer, prompt, max_new_tokens=50):
"""Greedy decoding - always picks highest probability token."""
print("\n=== Greedy Decoding ===")
print(f"Prompt: {prompt}")
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=False,
num_beams=1,
pad_token_id=tokenizer.pad_token_id
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated: {generated_text}\n")
def generate_with_beam_search(model, tokenizer, prompt, max_new_tokens=50, num_beams=5):
"""Beam search - explores multiple hypotheses."""
print("\n=== Beam Search ===")
print(f"Prompt: {prompt}")
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
num_beams=num_beams,
early_stopping=True,
no_repeat_ngram_size=2,
pad_token_id=tokenizer.pad_token_id
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated: {generated_text}\n")
def generate_with_sampling(model, tokenizer, prompt, max_new_tokens=50,
temperature=0.7, top_k=50, top_p=0.9):
"""Sampling with temperature, top-k, and nucleus (top-p) sampling."""
print("\n=== Sampling (Temperature + Top-K + Top-P) ===")
print(f"Prompt: {prompt}")
print(f"Parameters: temperature={temperature}, top_k={top_k}, top_p={top_p}")
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=temperature,
top_k=top_k,
top_p=top_p,
pad_token_id=tokenizer.pad_token_id
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated: {generated_text}\n")
def generate_multiple_sequences(model, tokenizer, prompt, max_new_tokens=50,
num_return_sequences=3):
"""Generate multiple diverse sequences."""
print("\n=== Multiple Sequences (with Sampling) ===")
print(f"Prompt: {prompt}")
print(f"Generating {num_return_sequences} sequences...")
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.8,
top_p=0.95,
num_return_sequences=num_return_sequences,
pad_token_id=tokenizer.pad_token_id
)
for i, output in enumerate(outputs):
generated_text = tokenizer.decode(output, skip_special_tokens=True)
print(f"\nSequence {i+1}: {generated_text}")
print()
def generate_with_config(model, tokenizer, prompt):
"""Use GenerationConfig for reusable configuration."""
print("\n=== Using GenerationConfig ===")
print(f"Prompt: {prompt}")
# Create a generation config
generation_config = GenerationConfig(
max_new_tokens=50,
do_sample=True,
temperature=0.7,
top_p=0.9,
top_k=50,
repetition_penalty=1.2,
no_repeat_ngram_size=3,
pad_token_id=tokenizer.pad_token_id
)
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, generation_config=generation_config)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated: {generated_text}\n")
def compare_temperatures(model, tokenizer, prompt, max_new_tokens=50):
"""Compare different temperature settings."""
print("\n=== Temperature Comparison ===")
print(f"Prompt: {prompt}\n")
temperatures = [0.3, 0.7, 1.0, 1.5]
for temp in temperatures:
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=temp,
top_p=0.9,
pad_token_id=tokenizer.pad_token_id
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Temperature {temp}: {generated_text}\n")
def main():
"""Run all generation examples."""
print("=" * 70)
print("Text Generation Examples")
print("=" * 70)
# Load model and tokenizer
model, tokenizer = load_model_and_tokenizer("gpt2")
# Example prompts
story_prompt = "Once upon a time in a distant galaxy"
factual_prompt = "The three branches of the US government are"
# Demonstrate different strategies
generate_with_greedy(model, tokenizer, story_prompt)
generate_with_beam_search(model, tokenizer, factual_prompt)
generate_with_sampling(model, tokenizer, story_prompt)
generate_multiple_sequences(model, tokenizer, story_prompt, num_return_sequences=3)
generate_with_config(model, tokenizer, story_prompt)
compare_temperatures(model, tokenizer, story_prompt)
print("=" * 70)
print("All generation examples completed!")
print("=" * 70)
if __name__ == "__main__":
main()

View File

@@ -1,133 +0,0 @@
#!/usr/bin/env python3
"""
Quick inference using Transformers pipelines.
This script demonstrates how to quickly use pre-trained models for inference
across various tasks using the pipeline API.
"""
from transformers import pipeline
def text_classification_example():
"""Sentiment analysis example."""
print("=== Text Classification ===")
classifier = pipeline("text-classification")
result = classifier("I love using Transformers! It makes NLP so easy.")
print(f"Result: {result}\n")
def named_entity_recognition_example():
"""Named Entity Recognition example."""
print("=== Named Entity Recognition ===")
ner = pipeline("token-classification", aggregation_strategy="simple")
text = "My name is Sarah and I work at Microsoft in Seattle"
entities = ner(text)
for entity in entities:
print(f"{entity['word']}: {entity['entity_group']} (score: {entity['score']:.3f})")
print()
def question_answering_example():
"""Question Answering example."""
print("=== Question Answering ===")
qa = pipeline("question-answering")
context = "Paris is the capital and most populous city of France. It is located in northern France."
question = "What is the capital of France?"
answer = qa(question=question, context=context)
print(f"Question: {question}")
print(f"Answer: {answer['answer']} (score: {answer['score']:.3f})\n")
def text_generation_example():
"""Text generation example."""
print("=== Text Generation ===")
generator = pipeline("text-generation", model="gpt2")
prompt = "Once upon a time in a land far away"
generated = generator(prompt, max_length=50, num_return_sequences=1)
print(f"Prompt: {prompt}")
print(f"Generated: {generated[0]['generated_text']}\n")
def summarization_example():
"""Text summarization example."""
print("=== Summarization ===")
summarizer = pipeline("summarization")
article = """
The Transformers library provides thousands of pretrained models to perform tasks
on texts such as classification, information extraction, question answering,
summarization, translation, text generation, etc in over 100 languages. Its aim
is to make cutting-edge NLP easier to use for everyone. The library provides APIs
to quickly download and use pretrained models on a given text, fine-tune them on
your own datasets then share them with the community on the model hub.
"""
summary = summarizer(article, max_length=50, min_length=25, do_sample=False)
print(f"Summary: {summary[0]['summary_text']}\n")
def translation_example():
"""Translation example."""
print("=== Translation ===")
translator = pipeline("translation_en_to_fr")
text = "Hello, how are you today?"
translation = translator(text)
print(f"English: {text}")
print(f"French: {translation[0]['translation_text']}\n")
def zero_shot_classification_example():
"""Zero-shot classification example."""
print("=== Zero-Shot Classification ===")
classifier = pipeline("zero-shot-classification")
text = "This is a breaking news story about a major earthquake."
candidate_labels = ["politics", "sports", "science", "breaking news"]
result = classifier(text, candidate_labels)
print(f"Text: {text}")
print("Predictions:")
for label, score in zip(result['labels'], result['scores']):
print(f" {label}: {score:.3f}")
print()
def image_classification_example():
"""Image classification example (requires PIL)."""
print("=== Image Classification ===")
try:
from PIL import Image
import requests
classifier = pipeline("image-classification")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
predictions = classifier(image)
print("Top predictions:")
for pred in predictions[:3]:
print(f" {pred['label']}: {pred['score']:.3f}")
print()
except ImportError:
print("PIL not installed. Skipping image classification example.\n")
def main():
"""Run all examples."""
print("Transformers Quick Inference Examples")
print("=" * 50 + "\n")
# Text tasks
text_classification_example()
named_entity_recognition_example()
question_answering_example()
text_generation_example()
summarization_example()
translation_example()
zero_shot_classification_example()
# Vision task (optional)
image_classification_example()
print("=" * 50)
print("All examples completed!")
if __name__ == "__main__":
main()