Update Huggingface Transformer

This commit is contained in:
Timothy Kassis
2025-10-21 10:30:38 -07:00
parent 1a9149b089
commit 11da596765
12 changed files with 2328 additions and 3148 deletions

File diff suppressed because it is too large Load Diff

View File

@@ -1,22 +1,8 @@
# Text Generation Strategies
Comprehensive guide to text generation methods in Transformers for controlling output quality, creativity, and diversity.
Transformers provides flexible text generation capabilities through the `generate()` method, supporting multiple decoding strategies and configuration options.
## Overview
Text generation is the process of predicting tokens sequentially using a language model. The choice of generation strategy significantly impacts output quality, diversity, and computational cost.
**When to use each strategy:**
- **Greedy**: Fast, deterministic, good for short outputs or when consistency is critical
- **Beam Search**: Better quality for tasks with clear "correct" answers (translation, summarization)
- **Sampling**: Creative, diverse outputs for open-ended generation (stories, dialogue)
- **Top-k/Top-p**: Balanced creativity and coherence
## Basic Generation Methods
### Greedy Decoding
Selects the highest probability token at each step. Fast but prone to repetition and suboptimal sequences.
## Basic Generation
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -24,507 +10,364 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
inputs = tokenizer("The future of AI", return_tensors="pt")
# Greedy decoding (default)
inputs = tokenizer("Once upon a time", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))
generated_text = tokenizer.decode(outputs[0])
```
**Characteristics:**
- Deterministic (always same output for same input)
- Fast (single forward pass per token)
- Prone to repetition in longer sequences
- Best for: Short generations, deterministic applications
## Decoding Strategies
**Parameters:**
```python
outputs = model.generate(
**inputs,
max_new_tokens=50, # Number of tokens to generate
min_length=10, # Minimum total length
pad_token_id=tokenizer.pad_token_id,
)
```
### 1. Greedy Decoding
### Beam Search
Maintains multiple hypotheses (beams) and selects the sequence with highest overall probability.
Selects the token with highest probability at each step. Deterministic but can be repetitive.
```python
outputs = model.generate(
**inputs,
max_new_tokens=50,
num_beams=5, # Number of beams
early_stopping=True, # Stop when all beams finish
no_repeat_ngram_size=2, # Prevent 2-gram repetition
do_sample=False,
num_beams=1 # Greedy is default when num_beams=1 and do_sample=False
)
```
**Characteristics:**
- Higher quality than greedy for tasks with "correct" answers
- Slower than greedy (num_beams forward passes per step)
- Still can suffer from repetition
- Best for: Translation, summarization, QA generation
### 2. Beam Search
Explores multiple hypotheses simultaneously, keeping top-k candidates at each step.
**Advanced Parameters:**
```python
outputs = model.generate(
**inputs,
max_new_tokens=50,
num_beams=5, # Number of beams
early_stopping=True, # Stop when all beams reach EOS
no_repeat_ngram_size=2, # Prevent repeating n-grams
)
```
**Key parameters:**
- `num_beams`: Number of beams (higher = more thorough but slower)
- `early_stopping`: Stop when all beams finish (True/False)
- `length_penalty`: Exponential penalty for length (>1.0 favors longer sequences)
- `no_repeat_ngram_size`: Prevent repeating n-grams
### 3. Sampling (Multinomial)
Samples from probability distribution, introducing randomness and diversity.
```python
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=True,
temperature=0.7, # Controls randomness (lower = more focused)
top_k=50, # Consider only top-k tokens
top_p=0.9, # Nucleus sampling (cumulative probability threshold)
)
```
**Key parameters:**
- `temperature`: Scales logits before softmax (0.1-2.0 typical range)
- Lower (0.1-0.7): More focused, deterministic
- Higher (0.8-1.5): More creative, random
- `top_k`: Sample from top-k tokens only
- `top_p`: Nucleus sampling - sample from smallest set with cumulative probability > p
### 4. Beam Search with Sampling
Combines beam search with sampling for diverse but coherent outputs.
```python
outputs = model.generate(
**inputs,
max_new_tokens=50,
num_beams=5,
num_beam_groups=1, # Diverse beam search groups
diversity_penalty=0.0, # Penalty for similar beams
length_penalty=1.0, # >1: longer sequences, <1: shorter
early_stopping=True, # Stop when num_beams sequences finish
no_repeat_ngram_size=2, # Block repeating n-grams
num_return_sequences=1, # Return top-k sequences (≤ num_beams)
)
```
**Length Penalty:**
- `length_penalty > 1.0`: Favor longer sequences
- `length_penalty = 1.0`: No penalty
- `length_penalty < 1.0`: Favor shorter sequences
### Sampling (Multinomial)
Randomly sample tokens according to the probability distribution.
```python
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=True, # Enable sampling
temperature=1.0, # Sampling temperature
num_beams=1, # Must be 1 for sampling
)
```
**Characteristics:**
- Non-deterministic (different output each time)
- More diverse and creative than greedy/beam search
- Can produce incoherent output if not controlled
- Best for: Creative writing, dialogue, open-ended generation
**Temperature Parameter:**
```python
# Low temperature (0.1-0.7): More focused, less random
outputs = model.generate(**inputs, do_sample=True, temperature=0.5)
# Medium temperature (0.7-1.0): Balanced
outputs = model.generate(**inputs, do_sample=True, temperature=0.8)
# High temperature (1.0-2.0): More random, more creative
outputs = model.generate(**inputs, do_sample=True, temperature=1.5)
```
- `temperature → 0`: Approaches greedy decoding
- `temperature = 1.0`: Sample from original distribution
- `temperature > 1.0`: Flatter distribution, more random
- `temperature < 1.0`: Sharper distribution, more confident
## Advanced Sampling Methods
### Top-k Sampling
Sample from only the k most likely tokens.
```python
outputs = model.generate(
**inputs,
do_sample=True,
max_new_tokens=50,
top_k=50, # Consider top 50 tokens
temperature=0.8,
top_k=50,
)
```
**How it works:**
1. Filter to top-k most probable tokens
2. Renormalize probabilities
3. Sample from filtered distribution
### 5. Contrastive Search
**Choosing k:**
- `k=1`: Equivalent to greedy decoding
- `k=10-50`: More focused, coherent output
- `k=100-500`: More diverse output
- Too high k: Includes low-probability tokens (noise)
- Too low k: Less diverse, may miss good alternatives
### Top-p (Nucleus) Sampling
Sample from the smallest set of tokens whose cumulative probability ≥ p.
Balances coherence and diversity using contrastive objective.
```python
outputs = model.generate(
**inputs,
do_sample=True,
max_new_tokens=50,
top_p=0.95, # Nucleus probability
temperature=0.8,
penalty_alpha=0.6, # Contrastive penalty
top_k=4, # Consider top-k candidates
)
```
**How it works:**
1. Sort tokens by probability
2. Find smallest set with cumulative probability ≥ p
3. Sample from this set
### 6. Assisted Decoding
**Choosing p:**
- `p=0.9-0.95`: Good balance (recommended)
- `p=1.0`: Sample from full distribution
- Higher p: More diverse, might include unlikely tokens
- Lower p: More focused, like top-k with adaptive k
**Top-p vs Top-k:**
- Top-p adapts to probability distribution shape
- Top-k is fixed regardless of distribution
- Top-p generally better for variable-quality contexts
- Can combine: `top_k=50, top_p=0.95` (apply both filters)
### Combining Strategies
Uses a smaller "assistant" model to speed up generation of larger model.
```python
# Recommended for high-quality open-ended generation
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("gpt2-large")
assistant_model = AutoModelForCausalLM.from_pretrained("gpt2")
outputs = model.generate(
**inputs,
do_sample=True,
assistant_model=assistant_model,
max_new_tokens=50,
)
```
## GenerationConfig
Configure generation parameters with `GenerationConfig` for reusability.
```python
from transformers import GenerationConfig
generation_config = GenerationConfig(
max_new_tokens=100,
temperature=0.8, # Moderate temperature
top_k=50, # Limit to top 50 tokens
top_p=0.95, # Nucleus sampling
repetition_penalty=1.2, # Discourage repetition
no_repeat_ngram_size=3, # Block 3-gram repetition
do_sample=True,
temperature=0.7,
top_p=0.9,
top_k=50,
repetition_penalty=1.2,
no_repeat_ngram_size=3,
)
# Use with model
outputs = model.generate(**inputs, generation_config=generation_config)
# Save and load
generation_config.save_pretrained("./config")
loaded_config = GenerationConfig.from_pretrained("./config")
```
## Controlling Generation Quality
## Key Parameters Reference
### Output Length Control
- `max_length`: Maximum total tokens (input + output)
- `max_new_tokens`: Maximum new tokens to generate (recommended over max_length)
- `min_length`: Minimum total tokens
- `min_new_tokens`: Minimum new tokens to generate
### Sampling Parameters
- `temperature`: Sampling temperature (0.1-2.0, default 1.0)
- `top_k`: Top-k sampling (1-100, typically 50)
- `top_p`: Nucleus sampling (0.0-1.0, typically 0.9)
- `do_sample`: Enable sampling (True/False)
### Beam Search Parameters
- `num_beams`: Number of beams (1-20, typically 5)
- `early_stopping`: Stop when beams finish (True/False)
- `length_penalty`: Length penalty (>1.0 favors longer, <1.0 favors shorter)
- `num_beam_groups`: Diverse beam search groups
- `diversity_penalty`: Penalty for similar beams
### Repetition Control
Prevent models from repeating themselves:
- `repetition_penalty`: Penalty for repeating tokens (1.0-2.0, default 1.0)
- `no_repeat_ngram_size`: Prevent repeating n-grams (2-5 typical)
- `encoder_repetition_penalty`: Penalty for repeating encoder tokens
```python
outputs = model.generate(
**inputs,
max_new_tokens=100,
### Special Tokens
# Method 1: Repetition penalty
repetition_penalty=1.2, # Penalize repeated tokens (>1.0)
- `bos_token_id`: Beginning of sequence token
- `eos_token_id`: End of sequence token (or list of tokens)
- `pad_token_id`: Padding token
- `forced_bos_token_id`: Force specific token at beginning
- `forced_eos_token_id`: Force specific token at end
# Method 2: Block n-gram repetition
no_repeat_ngram_size=3, # Never repeat 3-grams
### Multiple Sequences
# Method 3: Encoder repetition penalty (for seq2seq)
encoder_repetition_penalty=1.0, # Penalize input tokens
)
```
- `num_return_sequences`: Number of sequences to return
- `num_beam_groups`: Number of diverse beam groups
**Repetition Penalty Values:**
- `1.0`: No penalty
- `1.0-1.5`: Mild penalty (recommended: 1.1-1.3)
- `>1.5`: Strong penalty (may harm coherence)
## Advanced Generation Techniques
### Length Control
### Constrained Generation
```python
outputs = model.generate(
**inputs,
# Hard constraints
min_length=20, # Minimum total length
max_length=100, # Maximum total length
max_new_tokens=50, # Maximum new tokens (excluding input)
# Soft constraints (with beam search)
length_penalty=1.0, # Encourage longer/shorter outputs
# Early stopping
early_stopping=True, # Stop when condition met
)
```
### Bad Words and Forced Tokens
```python
# Prevent specific tokens
bad_words_ids = [
tokenizer.encode("badword1", add_special_tokens=False),
tokenizer.encode("badword2", add_special_tokens=False),
]
outputs = model.generate(
**inputs,
bad_words_ids=bad_words_ids,
)
# Force specific tokens
force_words_ids = [
tokenizer.encode("important", add_special_tokens=False),
]
outputs = model.generate(
**inputs,
force_words_ids=force_words_ids,
)
```
## Streaming Generation
Generate and process tokens as they're produced:
```python
from transformers import TextStreamer, TextIteratorStreamer
from threading import Thread
# Simple streaming (prints to stdout)
streamer = TextStreamer(tokenizer, skip_prompt=True)
outputs = model.generate(**inputs, streamer=streamer, max_new_tokens=100)
# Iterator streaming (for custom processing)
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True)
generation_kwargs = dict(**inputs, streamer=streamer, max_new_tokens=100)
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
for text in streamer:
print(text, end="", flush=True)
thread.join()
```
## Advanced Techniques
### Contrastive Search
Balance coherence and diversity using contrastive objective:
```python
outputs = model.generate(
**inputs,
max_new_tokens=50,
penalty_alpha=0.6, # Contrastive penalty
top_k=4, # Consider top-4 tokens
)
```
**When to use:**
- Open-ended text generation
- Reduces repetition without sacrificing coherence
- Good alternative to sampling
### Diverse Beam Search
Generate multiple diverse outputs:
```python
outputs = model.generate(
**inputs,
max_new_tokens=50,
num_beams=10,
num_beam_groups=5, # 5 groups of 2 beams each
diversity_penalty=1.0, # Penalty for similar beams
num_return_sequences=5, # Return 5 diverse outputs
)
```
### Constrained Beam Search
Force output to include specific phrases:
Force generation to include specific tokens or follow patterns.
```python
from transformers import PhrasalConstraint
constraints = [
PhrasalConstraint(
tokenizer("machine learning", add_special_tokens=False).input_ids
),
PhrasalConstraint(tokenizer("New York", add_special_tokens=False).input_ids)
]
outputs = model.generate(
**inputs,
constraints=constraints,
num_beams=10, # Requires beam search
num_beams=5,
)
```
## Speculative Decoding
### Streaming Generation
Accelerate generation using a smaller draft model:
Generate tokens one at a time for real-time display.
```python
from transformers import AutoModelForCausalLM
from transformers import TextIteratorStreamer
from threading import Thread
# Load main and assistant models
model = AutoModelForCausalLM.from_pretrained("large-model")
assistant_model = AutoModelForCausalLM.from_pretrained("small-model")
streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)
generation_kwargs = dict(
**inputs,
max_new_tokens=100,
streamer=streamer,
)
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
for new_text in streamer:
print(new_text, end="", flush=True)
thread.join()
```
### Logit Processors
Customize token selection with custom logit processors.
```python
from transformers import LogitsProcessor, LogitsProcessorList
class CustomLogitsProcessor(LogitsProcessor):
def __call__(self, input_ids, scores):
# Modify scores here
return scores
logits_processor = LogitsProcessorList([CustomLogitsProcessor()])
# Generate with speculative decoding
outputs = model.generate(
**inputs,
assistant_model=assistant_model,
logits_processor=logits_processor,
)
```
### Stopping Criteria
Define custom stopping conditions.
```python
from transformers import StoppingCriteria, StoppingCriteriaList
class CustomStoppingCriteria(StoppingCriteria):
def __call__(self, input_ids, scores, **kwargs):
# Return True to stop generation
return False
stopping_criteria = StoppingCriteriaList([CustomStoppingCriteria()])
outputs = model.generate(
**inputs,
stopping_criteria=stopping_criteria,
)
```
## Best Practices
### For Creative Tasks (Stories, Dialogue)
```python
outputs = model.generate(
**inputs,
max_new_tokens=200,
do_sample=True,
temperature=0.8,
)
```
**Benefits:**
- 2-3x faster generation
- Identical output distribution to regular generation
- Works with sampling and greedy decoding
## Recipe: Recommended Settings by Task
### Creative Writing / Dialogue
```python
outputs = model.generate(
**inputs,
do_sample=True,
max_new_tokens=200,
temperature=0.9,
top_p=0.95,
top_k=50,
repetition_penalty=1.2,
no_repeat_ngram_size=3,
)
```
### Translation / Summarization
### For Factual Tasks (Summaries, QA)
```python
outputs = model.generate(
**inputs,
num_beams=5,
max_new_tokens=150,
max_new_tokens=100,
num_beams=4,
early_stopping=True,
length_penalty=1.0,
no_repeat_ngram_size=2,
length_penalty=1.0,
)
```
### Code Generation
### For Chat/Instruction Following
```python
outputs = model.generate(
**inputs,
max_new_tokens=300,
temperature=0.2, # Low temperature for correctness
top_p=0.95,
max_new_tokens=512,
do_sample=True,
)
```
### Chatbot / Instruction Following
```python
outputs = model.generate(
**inputs,
do_sample=True,
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.15,
repetition_penalty=1.1,
)
```
### Factual QA / Information Extraction
## Vision-Language Model Generation
For models like LLaVA, BLIP-2, etc.:
```python
outputs = model.generate(
**inputs,
max_new_tokens=50,
num_beams=3,
early_stopping=True,
# Or greedy for very short answers:
# (no special parameters needed)
)
```
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
## Debugging Generation
model = AutoModelForVision2Seq.from_pretrained("llava-hf/llava-1.5-7b-hf")
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
### Check Token Probabilities
```python
outputs = model.generate(
**inputs,
max_new_tokens=20,
output_scores=True, # Return generation scores
return_dict_in_generate=True, # Return as dict
)
# Access generation scores
scores = outputs.scores # Tuple of tensors (seq_len, vocab_size)
# Get token probabilities
import torch
probs = torch.softmax(scores[0], dim=-1)
```
### Monitor Generation Process
```python
from transformers import LogitsProcessor, LogitsProcessorList
class DebugLogitsProcessor(LogitsProcessor):
def __call__(self, input_ids, scores):
# Print top 5 tokens at each step
top_tokens = scores[0].topk(5)
print(f"Top 5 tokens: {top_tokens}")
return scores
image = Image.open("image.jpg")
inputs = processor(text="Describe this image", images=image, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=10,
logits_processor=LogitsProcessorList([DebugLogitsProcessor()]),
max_new_tokens=100,
do_sample=True,
temperature=0.7,
)
generated_text = processor.decode(outputs[0], skip_special_tokens=True)
```
## Common Issues and Solutions
**Issue: Repetitive output**
- Solution: Increase `repetition_penalty` (1.2-1.5), set `no_repeat_ngram_size=3`
- For sampling: Increase `temperature`, enable `top_p`
**Issue: Incoherent output**
- Solution: Lower `temperature` (0.5-0.8), use beam search
- Set `top_k=50` or `top_p=0.9` to filter unlikely tokens
**Issue: Too short output**
- Solution: Increase `min_length`, set `length_penalty > 1.0` (beam search)
- Check if EOS token is being generated early
**Issue: Too slow generation**
- Solution: Use greedy instead of beam search
- Reduce `num_beams`
- Try speculative decoding with assistant model
- Use smaller model variant
**Issue: Output doesn't follow format**
- Solution: Use constrained beam search
- Add format examples to prompt
- Use `bad_words_ids` to prevent format-breaking tokens
## Performance Optimization
### Use KV Cache
```python
# Use half precision
model = AutoModelForCausalLM.from_pretrained(
"model-name",
torch_dtype=torch.float16,
device_map="auto"
)
# Use KV cache optimization (default, but can be disabled)
# KV cache is enabled by default
outputs = model.generate(**inputs, use_cache=True)
# Batch generation
inputs = tokenizer(["Prompt 1", "Prompt 2"], return_tensors="pt", padding=True)
outputs = model.generate(**inputs, max_new_tokens=50)
# Static cache for longer sequences (if supported)
outputs = model.generate(**inputs, cache_implementation="static")
```
This guide covers the main generation strategies. For task-specific examples, see `task_patterns.md`.
### Mixed Precision
```python
import torch
with torch.cuda.amp.autocast():
outputs = model.generate(**inputs, max_new_tokens=100)
```
### Batch Generation
```python
texts = ["Prompt 1", "Prompt 2", "Prompt 3"]
inputs = tokenizer(texts, return_tensors="pt", padding=True)
outputs = model.generate(**inputs, max_new_tokens=50)
```
### Quantization
```python
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=quantization_config,
device_map="auto"
)
```

View File

@@ -0,0 +1,234 @@
# Transformers Pipelines
Pipelines provide a simple and optimized interface for inference across many machine learning tasks. They abstract away the complexity of tokenization, model invocation, and post-processing.
## Usage Pattern
```python
from transformers import pipeline
# Basic usage
classifier = pipeline("text-classification")
result = classifier("This movie was amazing!")
# With specific model
classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
result = classifier("This movie was amazing!")
```
## Natural Language Processing Pipelines
### Text Classification
```python
classifier = pipeline("text-classification")
classifier("I love this product!")
# [{'label': 'POSITIVE', 'score': 0.9998}]
```
### Zero-Shot Classification
```python
classifier = pipeline("zero-shot-classification")
classifier("This is about climate change", candidate_labels=["politics", "science", "sports"])
```
### Token Classification (NER)
```python
ner = pipeline("token-classification")
ner("My name is Sarah and I work at Microsoft in Seattle")
```
### Question Answering
```python
qa = pipeline("question-answering")
qa(question="What is the capital?", context="The capital of France is Paris.")
```
### Text Generation
```python
generator = pipeline("text-generation")
generator("Once upon a time", max_length=50)
```
### Text2Text Generation
```python
generator = pipeline("text2text-generation", model="t5-base")
generator("translate English to French: Hello")
```
### Summarization
```python
summarizer = pipeline("summarization")
summarizer("Long article text here...", max_length=130, min_length=30)
```
### Translation
```python
translator = pipeline("translation_en_to_fr")
translator("Hello, how are you?")
```
### Fill Mask
```python
unmasker = pipeline("fill-mask")
unmasker("Paris is the [MASK] of France.")
```
### Feature Extraction
```python
extractor = pipeline("feature-extraction")
embeddings = extractor("This is a sentence")
```
### Document Question Answering
```python
doc_qa = pipeline("document-question-answering")
doc_qa(image="document.png", question="What is the invoice number?")
```
### Table Question Answering
```python
table_qa = pipeline("table-question-answering")
table_qa(table=data, query="How many employees?")
```
## Computer Vision Pipelines
### Image Classification
```python
classifier = pipeline("image-classification")
classifier("cat.jpg")
```
### Zero-Shot Image Classification
```python
classifier = pipeline("zero-shot-image-classification")
classifier("cat.jpg", candidate_labels=["cat", "dog", "bird"])
```
### Object Detection
```python
detector = pipeline("object-detection")
detector("street.jpg")
```
### Image Segmentation
```python
segmenter = pipeline("image-segmentation")
segmenter("image.jpg")
```
### Image-to-Image
```python
img2img = pipeline("image-to-image", model="lllyasviel/sd-controlnet-canny")
img2img("input.jpg")
```
### Depth Estimation
```python
depth = pipeline("depth-estimation")
depth("image.jpg")
```
### Video Classification
```python
classifier = pipeline("video-classification")
classifier("video.mp4")
```
### Keypoint Matching
```python
matcher = pipeline("keypoint-matching")
matcher(image1="img1.jpg", image2="img2.jpg")
```
## Audio Pipelines
### Automatic Speech Recognition
```python
asr = pipeline("automatic-speech-recognition")
asr("audio.wav")
```
### Audio Classification
```python
classifier = pipeline("audio-classification")
classifier("audio.wav")
```
### Zero-Shot Audio Classification
```python
classifier = pipeline("zero-shot-audio-classification")
classifier("audio.wav", candidate_labels=["speech", "music", "noise"])
```
### Text-to-Audio/Text-to-Speech
```python
synthesizer = pipeline("text-to-audio")
audio = synthesizer("Hello, how are you today?")
```
## Multimodal Pipelines
### Image-to-Text (Image Captioning)
```python
captioner = pipeline("image-to-text")
captioner("image.jpg")
```
### Visual Question Answering
```python
vqa = pipeline("visual-question-answering")
vqa(image="image.jpg", question="What color is the car?")
```
### Image-Text-to-Text (VLMs)
```python
vlm = pipeline("image-text-to-text")
vlm(images="image.jpg", text="Describe this image in detail")
```
### Zero-Shot Object Detection
```python
detector = pipeline("zero-shot-object-detection")
detector("image.jpg", candidate_labels=["car", "person", "tree"])
```
## Pipeline Configuration
### Common Parameters
- `model`: Specify model identifier or path
- `device`: Set device (0 for GPU, -1 for CPU, or "cuda:0")
- `batch_size`: Process multiple inputs at once
- `torch_dtype`: Set precision (torch.float16, torch.bfloat16)
```python
# GPU with half precision
pipe = pipeline("text-generation", model="gpt2", device=0, torch_dtype=torch.float16)
# Batch processing
pipe(["text 1", "text 2", "text 3"], batch_size=8)
```
### Task-Specific Parameters
Each pipeline accepts task-specific parameters in the call:
```python
# Text generation
generator("prompt", max_length=100, temperature=0.7, top_p=0.9, num_return_sequences=3)
# Summarization
summarizer("text", max_length=130, min_length=30, do_sample=False)
# Translation
translator("text", max_length=512, num_beams=4)
```
## Best Practices
1. **Reuse pipelines**: Create once, use multiple times for efficiency
2. **Batch processing**: Use batches for multiple inputs to maximize throughput
3. **GPU acceleration**: Set `device=0` for GPU when available
4. **Model selection**: Choose task-specific models for best results
5. **Memory management**: Use `torch_dtype=torch.float16` for large models

View File

@@ -1,504 +0,0 @@
# Model Quantization Guide
Comprehensive guide to reducing model memory footprint through quantization while maintaining accuracy.
## Overview
Quantization reduces memory requirements by storing model weights in lower precision formats (int8, int4) instead of full precision (float32). This enables:
- Running larger models on limited hardware
- Faster inference (reduced memory bandwidth)
- Lower deployment costs
- Enabling fine-tuning of models that wouldn't fit in memory
**Tradeoffs:**
- Slight accuracy loss (typically < 1-2%)
- Initial quantization overhead
- Some methods require calibration data
## Quick Comparison
| Method | Precision | Speed | Accuracy | Fine-tuning | Hardware | Setup |
|--------|-----------|-------|----------|-------------|----------|-------|
| **Bitsandbytes** | 4/8-bit | Fast | High | Yes (PEFT) | CUDA, CPU | Easy |
| **GPTQ** | 2-8-bit | Very Fast | High | Limited | CUDA, ROCm, Metal | Medium |
| **AWQ** | 4-bit | Very Fast | High | Yes (PEFT) | CUDA, ROCm | Medium |
| **GGUF** | 1-8-bit | Medium | Variable | No | CPU-optimized | Easy |
| **HQQ** | 1-8-bit | Fast | High | Yes | Multi-platform | Medium |
## Bitsandbytes (BnB)
On-the-fly quantization with excellent PEFT fine-tuning support.
### 8-bit Quantization
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
load_in_8bit=True, # Enable 8-bit quantization
device_map="auto", # Automatic device placement
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
# Use normally
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=50)
```
**Memory Savings:**
- 7B model: ~14GB → ~7GB (50% reduction)
- 13B model: ~26GB → ~13GB
- 70B model: ~140GB → ~70GB
**Characteristics:**
- Fast inference
- Minimal accuracy loss
- Works with PEFT (LoRA, QLoRA)
- Supports CPU and CUDA GPUs
### 4-bit Quantization (QLoRA)
```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, # Enable 4-bit quantization
bnb_4bit_quant_type="nf4", # Quantization type ("nf4" or "fp4")
bnb_4bit_compute_dtype=torch.float16, # Computation dtype
bnb_4bit_use_double_quant=True, # Nested quantization for more savings
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto",
)
```
**Memory Savings:**
- 7B model: ~14GB → ~4GB (70% reduction)
- 13B model: ~26GB → ~7GB
- 70B model: ~140GB → ~35GB
**Quantization Types:**
- `nf4`: Normal Float 4 (recommended, better quality)
- `fp4`: Float Point 4 (slightly more memory efficient)
**Compute Dtype:**
```python
# For better quality
bnb_4bit_compute_dtype=torch.float16
# For best performance on Ampere+ GPUs
bnb_4bit_compute_dtype=torch.bfloat16
```
**Double Quantization:**
```python
# Enable for additional ~0.4 bits/param savings
bnb_4bit_use_double_quant=True # Quantize the quantization constants
```
### Fine-tuning with QLoRA
```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch
# Load quantized model
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto",
)
# Prepare for training
model = prepare_model_for_kbit_training(model)
# Configure LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
# Train normally
trainer = Trainer(model=model, args=training_args, ...)
trainer.train()
```
## GPTQ
Post-training quantization requiring calibration, optimized for inference speed.
### Loading GPTQ Models
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
# Load pre-quantized GPTQ model
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-7B-GPTQ", # Pre-quantized model
device_map="auto",
revision="gptq-4bit-32g-actorder_True", # Specific quantization config
)
# Or quantize yourself
gptq_config = GPTQConfig(
bits=4, # 2, 3, 4, 8 bits
dataset="c4", # Calibration dataset
tokenizer=tokenizer,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
device_map="auto",
quantization_config=gptq_config,
)
# Save quantized model
model.save_pretrained("llama-2-7b-gptq")
```
**Configuration Options:**
```python
gptq_config = GPTQConfig(
bits=4, # Quantization bits
group_size=128, # Group size for quantization (128, 32, -1)
dataset="c4", # Calibration dataset
desc_act=False, # Activation order (can improve accuracy)
sym=True, # Symmetric quantization
damp_percent=0.1, # Dampening factor
)
```
**Characteristics:**
- Fastest inference among quantization methods
- Requires one-time calibration (slow)
- Best when using pre-quantized models from Hub
- Limited fine-tuning support
- Excellent for production deployment
## AWQ (Activation-aware Weight Quantization)
Protects important weights for better quality.
### Loading AWQ Models
```python
from transformers import AutoModelForCausalLM, AwqConfig
# Load pre-quantized AWQ model
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-7B-AWQ",
device_map="auto",
)
# Or quantize yourself
awq_config = AwqConfig(
bits=4, # 4-bit quantization
group_size=128, # Quantization group size
zero_point=True, # Use zero-point quantization
version="GEMM", # Quantization version
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=awq_config,
device_map="auto",
)
```
**Characteristics:**
- Better accuracy than GPTQ at same bit width
- Excellent inference speed
- Supports PEFT fine-tuning
- Requires calibration data
### Fine-tuning AWQ Models
```python
from peft import LoraConfig, get_peft_model
# AWQ models support LoRA fine-tuning
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
trainer = Trainer(model=model, ...)
trainer.train()
```
## GGUF (GGML Format)
CPU-optimized quantization format, popular in llama.cpp ecosystem.
### Using GGUF Models
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load GGUF model
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-7B-GGUF",
gguf_file="llama-2-7b.Q4_K_M.gguf", # Specific quantization file
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-7B-GGUF")
```
**GGUF Quantization Types:**
- `Q4_0`: 4-bit, smallest, lowest quality
- `Q4_K_M`: 4-bit, medium quality (recommended)
- `Q5_K_M`: 5-bit, good quality
- `Q6_K`: 6-bit, high quality
- `Q8_0`: 8-bit, very high quality
**Characteristics:**
- Optimized for CPU inference
- Wide range of bit depths (1-8)
- Good for Apple Silicon (M1/M2)
- No fine-tuning support
- Excellent for local/edge deployment
## HQQ (Half-Quadratic Quantization)
Flexible quantization with good accuracy retention.
### Using HQQ
```python
from transformers import AutoModelForCausalLM, HqqConfig
hqq_config = HqqConfig(
nbits=4, # Quantization bits
group_size=64, # Group size
quant_zero=False, # Quantize zero point
quant_scale=False, # Quantize scale
axis=0, # Quantization axis
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=hqq_config,
device_map="auto",
)
```
**Characteristics:**
- Very fast quantization
- No calibration data needed
- Support for 1-8 bits
- Can serialize/deserialize
- Good accuracy vs size tradeoff
## Choosing a Quantization Method
### Decision Tree
**For inference only:**
1. Need fastest inference? → **GPTQ or AWQ** (use pre-quantized models)
2. CPU-only deployment? → **GGUF**
3. Want easiest setup? → **Bitsandbytes 8-bit**
4. Need extreme compression? → **GGUF Q4_0 or HQQ 2-bit**
**For fine-tuning:**
1. Limited VRAM? → **QLoRA (BnB 4-bit + LoRA)**
2. Want best accuracy? → **Bitsandbytes 8-bit + LoRA**
3. Need very large models? → **QLoRA with double quantization**
**For production:**
1. Latency-critical? → **GPTQ or AWQ**
2. Cost-optimized? → **Bitsandbytes 8-bit**
3. CPU deployment? → **GGUF**
## Memory Requirements
Approximate memory for Llama-2 7B model:
| Method | Memory | vs FP16 |
|--------|--------|---------|
| FP32 | 28GB | 2x |
| FP16 / BF16 | 14GB | 1x |
| 8-bit (BnB) | 7GB | 0.5x |
| 4-bit (QLoRA) | 3.5GB | 0.25x |
| 4-bit Double Quant | 3GB | 0.21x |
| GPTQ 4-bit | 4GB | 0.29x |
| AWQ 4-bit | 4GB | 0.29x |
**Note:** Add ~1-2GB for inference activations, KV cache, and framework overhead.
## Best Practices
### For Training
```python
# QLoRA recommended configuration
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16, # BF16 if available
bnb_4bit_use_double_quant=True,
)
# LoRA configuration
lora_config = LoraConfig(
r=16, # Rank (8, 16, 32, 64)
lora_alpha=32, # Scaling (typically 2*r)
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
```
### For Inference
```python
# High-speed inference
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-7B-GPTQ",
device_map="auto",
torch_dtype=torch.float16, # Use FP16 for activations
)
# Balanced quality/speed
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
load_in_8bit=True,
device_map="auto",
)
# Maximum compression
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
),
device_map="auto",
)
```
### Multi-GPU Setups
```python
# Automatically distribute across GPUs
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b-hf",
load_in_4bit=True,
device_map="auto", # Automatic distribution
max_memory={0: "20GB", 1: "20GB"}, # Optional: limit per GPU
)
# Manual device map
device_map = {
"model.embed_tokens": 0,
"model.layers.0": 0,
"model.layers.1": 0,
# ... distribute layers ...
"model.norm": 1,
"lm_head": 1,
}
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b-hf",
load_in_4bit=True,
device_map=device_map,
)
```
## Troubleshooting
**Issue: OOM during quantization**
```python
# Solution: Use low_cpu_mem_usage
model = AutoModelForCausalLM.from_pretrained(
"model-name",
quantization_config=config,
device_map="auto",
low_cpu_mem_usage=True, # Reduce CPU memory during loading
)
```
**Issue: Slow quantization**
```python
# GPTQ/AWQ take time to calibrate
# Solution: Use pre-quantized models from Hub
model = AutoModelForCausalLM.from_pretrained("TheBloke/Model-GPTQ")
# Or use BnB for instant quantization
model = AutoModelForCausalLM.from_pretrained("model-name", load_in_4bit=True)
```
**Issue: Poor quality after quantization**
```python
# Try different quantization types
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # Try "nf4" instead of "fp4"
bnb_4bit_compute_dtype=torch.bfloat16, # Use BF16 if available
)
# Or use 8-bit instead of 4-bit
model = AutoModelForCausalLM.from_pretrained("model-name", load_in_8bit=True)
```
**Issue: Can't fine-tune quantized model**
```python
# Ensure using compatible quantization method
from peft import prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model)
# Only BnB and AWQ support PEFT fine-tuning
# GPTQ has limited support, GGUF doesn't support fine-tuning
```
## Performance Benchmarks
Approximate generation speed (tokens/sec) for Llama-2 7B on A100 40GB:
| Method | Speed | Memory |
|--------|-------|--------|
| FP16 | 100 tok/s | 14GB |
| 8-bit | 90 tok/s | 7GB |
| 4-bit QLoRA | 70 tok/s | 4GB |
| GPTQ 4-bit | 95 tok/s | 4GB |
| AWQ 4-bit | 95 tok/s | 4GB |
**Note:** Actual performance varies by hardware, sequence length, and batch size.
## Resources
- **Pre-quantized models:** Search "GPTQ" or "AWQ" on Hugging Face Hub
- **BnB documentation:** https://github.com/TimDettmers/bitsandbytes
- **PEFT library:** https://github.com/huggingface/peft
- **QLoRA paper:** https://arxiv.org/abs/2305.14314
For task-specific quantization examples, see `training_guide.md`.

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,328 @@
# Training with Transformers
Transformers provides comprehensive training capabilities through the `Trainer` API, supporting distributed training, mixed precision, and advanced optimization techniques.
## Basic Training Workflow
```python
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
Trainer,
TrainingArguments
)
from datasets import load_dataset
# 1. Load and preprocess data
dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
# 2. Load model
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=2
)
# 3. Define training arguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
learning_rate=2e-5,
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
)
# 4. Create trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["test"],
)
# 5. Train
trainer.train()
# 6. Evaluate
trainer.evaluate()
# 7. Save model
trainer.save_model("./final_model")
```
## TrainingArguments Configuration
### Essential Parameters
**Output and Logging:**
- `output_dir`: Directory for checkpoints and outputs (required)
- `logging_dir`: TensorBoard log directory (default: `{output_dir}/runs`)
- `logging_steps`: Log every N steps (default: 500)
- `logging_strategy`: "steps" or "epoch"
**Training Duration:**
- `num_train_epochs`: Number of epochs (default: 3.0)
- `max_steps`: Max training steps (overrides num_train_epochs if set)
**Batch Size and Gradient Accumulation:**
- `per_device_train_batch_size`: Batch size per device (default: 8)
- `per_device_eval_batch_size`: Eval batch size per device (default: 8)
- `gradient_accumulation_steps`: Accumulate gradients over N steps (default: 1)
- Effective batch size = `per_device_train_batch_size * gradient_accumulation_steps * num_gpus`
**Learning Rate:**
- `learning_rate`: Peak learning rate (default: 5e-5)
- `lr_scheduler_type`: Scheduler type ("linear", "cosine", "constant", etc.)
- `warmup_steps`: Warmup steps (default: 0)
- `warmup_ratio`: Warmup as fraction of total steps
**Evaluation:**
- `eval_strategy`: "no", "steps", or "epoch" (default: "no")
- `eval_steps`: Evaluate every N steps (if eval_strategy="steps")
- `eval_delay`: Delay evaluation until N steps
**Checkpointing:**
- `save_strategy`: "no", "steps", or "epoch" (default: "steps")
- `save_steps`: Save checkpoint every N steps (default: 500)
- `save_total_limit`: Keep only N most recent checkpoints
- `load_best_model_at_end`: Load best checkpoint at end (default: False)
- `metric_for_best_model`: Metric to determine best model
**Optimization:**
- `optim`: Optimizer ("adamw_torch", "adamw_hf", "sgd", etc.)
- `weight_decay`: Weight decay coefficient (default: 0.0)
- `adam_beta1`, `adam_beta2`: Adam optimizer betas
- `adam_epsilon`: Epsilon for Adam (default: 1e-8)
- `max_grad_norm`: Max gradient norm for clipping (default: 1.0)
### Mixed Precision Training
```python
training_args = TrainingArguments(
output_dir="./results",
fp16=True, # Use fp16 on NVIDIA GPUs
fp16_opt_level="O1", # O0, O1, O2, O3 (Apex levels)
# or
bf16=True, # Use bf16 on Ampere+ GPUs (better than fp16)
)
```
### Distributed Training
**DataParallel (single-node multi-GPU):**
```python
# Automatic with multiple GPUs
training_args = TrainingArguments(
output_dir="./results",
per_device_train_batch_size=16, # Per GPU
)
# Run: python script.py
```
**DistributedDataParallel (multi-node or multi-GPU):**
```bash
# Single node, multiple GPUs
python -m torch.distributed.launch --nproc_per_node=4 script.py
# Or use accelerate
accelerate config
accelerate launch script.py
```
**DeepSpeed Integration:**
```python
training_args = TrainingArguments(
output_dir="./results",
deepspeed="ds_config.json", # DeepSpeed config file
)
```
### Advanced Features
**Gradient Checkpointing (reduce memory):**
```python
training_args = TrainingArguments(
output_dir="./results",
gradient_checkpointing=True,
)
```
**Compilation with torch.compile:**
```python
training_args = TrainingArguments(
output_dir="./results",
torch_compile=True,
torch_compile_backend="inductor", # or "cudagraphs"
)
```
**Push to Hub:**
```python
training_args = TrainingArguments(
output_dir="./results",
push_to_hub=True,
hub_model_id="username/model-name",
hub_strategy="every_save", # or "end"
)
```
## Custom Training Components
### Custom Metrics
```python
import evaluate
import numpy as np
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
trainer = Trainer(
model=model,
args=training_args,
compute_metrics=compute_metrics,
)
```
### Custom Loss Function
```python
class CustomTrainer(Trainer):
def compute_loss(self, model, inputs, return_outputs=False):
labels = inputs.pop("labels")
outputs = model(**inputs)
logits = outputs.logits
# Custom loss calculation
loss_fct = torch.nn.CrossEntropyLoss(weight=class_weights)
loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
return (loss, outputs) if return_outputs else loss
```
### Data Collator
```python
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
data_collator=data_collator,
)
```
### Callbacks
```python
from transformers import TrainerCallback
class CustomCallback(TrainerCallback):
def on_epoch_end(self, args, state, control, **kwargs):
print(f"Epoch {state.epoch} completed!")
return control
trainer = Trainer(
model=model,
args=training_args,
callbacks=[CustomCallback],
)
```
## Hyperparameter Search
```python
def model_init():
return AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=2
)
trainer = Trainer(
model_init=model_init,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
)
# Optuna-based search
best_trial = trainer.hyperparameter_search(
direction="maximize",
backend="optuna",
n_trials=10,
hp_space=lambda trial: {
"learning_rate": trial.suggest_float("learning_rate", 1e-5, 5e-5, log=True),
"per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 16, 32]),
"num_train_epochs": trial.suggest_int("num_train_epochs", 2, 5),
}
)
```
## Training Best Practices
1. **Start with small learning rates**: 2e-5 to 5e-5 for fine-tuning
2. **Use warmup**: 5-10% of total steps for learning rate warmup
3. **Monitor training**: Use eval_strategy="epoch" or "steps" to track progress
4. **Save checkpoints**: Set save_strategy and save_total_limit
5. **Use mixed precision**: Enable fp16 or bf16 for faster training
6. **Gradient accumulation**: For large effective batch sizes on limited memory
7. **Load best model**: Set load_best_model_at_end=True to avoid overfitting
8. **Push to Hub**: Enable push_to_hub for easy model sharing and versioning
## Common Training Patterns
### Classification
```python
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=num_classes,
id2label=id2label,
label2id=label2id
)
```
### Question Answering
```python
model = AutoModelForQuestionAnswering.from_pretrained("bert-base-uncased")
```
### Token Classification (NER)
```python
model = AutoModelForTokenClassification.from_pretrained(
"bert-base-uncased",
num_labels=num_tags,
id2label=id2label,
label2id=label2id
)
```
### Sequence-to-Sequence
```python
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
```
### Causal Language Modeling
```python
model = AutoModelForCausalLM.from_pretrained("gpt2")
```
### Masked Language Modeling
```python
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")
```