Update Huggingface Transformer

This commit is contained in:
Timothy Kassis
2025-10-21 10:30:38 -07:00
parent 1a9149b089
commit 11da596765
12 changed files with 2328 additions and 3148 deletions

View File

@@ -7,7 +7,7 @@
},
"metadata": {
"description": "Claude scientific skills from K-Dense Inc",
"version": "1.37.0"
"version": "1.38.0"
},
"plugins": [
{

View File

@@ -32,7 +32,7 @@
- **scikit-learn** - Machine learning algorithms, preprocessing, and model selection
- **statsmodels** - Statistical modeling and econometrics (OLS, GLM, logit/probit, ARIMA, time series forecasting, hypothesis testing, diagnostics)
- **Torch Geometric** - Graph Neural Networks for molecular and geometric data
- **Transformers** - Hugging Face transformers for NLU, image classification, and generation
- **Transformers** - State-of-the-art machine learning models for NLP, computer vision, audio, and multimodal tasks. Provides 1M+ pre-trained models accessible via pipelines (text-classification, NER, QA, summarization, translation, text-generation, image-classification, object-detection, ASR, VQA), comprehensive training via Trainer API with distributed training and mixed precision, flexible text generation with multiple decoding strategies (greedy, beam search, sampling), and Auto classes for automatic architecture selection (BERT, GPT, T5, ViT, BART, etc.)
- **UMAP-learn** - Dimensionality reduction and manifold learning
## Materials Science & Chemistry

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -1,22 +1,8 @@
# Text Generation Strategies
Comprehensive guide to text generation methods in Transformers for controlling output quality, creativity, and diversity.
Transformers provides flexible text generation capabilities through the `generate()` method, supporting multiple decoding strategies and configuration options.
## Overview
Text generation is the process of predicting tokens sequentially using a language model. The choice of generation strategy significantly impacts output quality, diversity, and computational cost.
**When to use each strategy:**
- **Greedy**: Fast, deterministic, good for short outputs or when consistency is critical
- **Beam Search**: Better quality for tasks with clear "correct" answers (translation, summarization)
- **Sampling**: Creative, diverse outputs for open-ended generation (stories, dialogue)
- **Top-k/Top-p**: Balanced creativity and coherence
## Basic Generation Methods
### Greedy Decoding
Selects the highest probability token at each step. Fast but prone to repetition and suboptimal sequences.
## Basic Generation
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -24,507 +10,364 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
inputs = tokenizer("The future of AI", return_tensors="pt")
# Greedy decoding (default)
inputs = tokenizer("Once upon a time", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))
generated_text = tokenizer.decode(outputs[0])
```
**Characteristics:**
- Deterministic (always same output for same input)
- Fast (single forward pass per token)
- Prone to repetition in longer sequences
- Best for: Short generations, deterministic applications
## Decoding Strategies
**Parameters:**
```python
outputs = model.generate(
**inputs,
max_new_tokens=50, # Number of tokens to generate
min_length=10, # Minimum total length
pad_token_id=tokenizer.pad_token_id,
)
```
### 1. Greedy Decoding
### Beam Search
Maintains multiple hypotheses (beams) and selects the sequence with highest overall probability.
Selects the token with highest probability at each step. Deterministic but can be repetitive.
```python
outputs = model.generate(
**inputs,
max_new_tokens=50,
num_beams=5, # Number of beams
early_stopping=True, # Stop when all beams finish
no_repeat_ngram_size=2, # Prevent 2-gram repetition
do_sample=False,
num_beams=1 # Greedy is default when num_beams=1 and do_sample=False
)
```
**Characteristics:**
- Higher quality than greedy for tasks with "correct" answers
- Slower than greedy (num_beams forward passes per step)
- Still can suffer from repetition
- Best for: Translation, summarization, QA generation
### 2. Beam Search
Explores multiple hypotheses simultaneously, keeping top-k candidates at each step.
**Advanced Parameters:**
```python
outputs = model.generate(
**inputs,
max_new_tokens=50,
num_beams=5, # Number of beams
early_stopping=True, # Stop when all beams reach EOS
no_repeat_ngram_size=2, # Prevent repeating n-grams
)
```
**Key parameters:**
- `num_beams`: Number of beams (higher = more thorough but slower)
- `early_stopping`: Stop when all beams finish (True/False)
- `length_penalty`: Exponential penalty for length (>1.0 favors longer sequences)
- `no_repeat_ngram_size`: Prevent repeating n-grams
### 3. Sampling (Multinomial)
Samples from probability distribution, introducing randomness and diversity.
```python
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=True,
temperature=0.7, # Controls randomness (lower = more focused)
top_k=50, # Consider only top-k tokens
top_p=0.9, # Nucleus sampling (cumulative probability threshold)
)
```
**Key parameters:**
- `temperature`: Scales logits before softmax (0.1-2.0 typical range)
- Lower (0.1-0.7): More focused, deterministic
- Higher (0.8-1.5): More creative, random
- `top_k`: Sample from top-k tokens only
- `top_p`: Nucleus sampling - sample from smallest set with cumulative probability > p
### 4. Beam Search with Sampling
Combines beam search with sampling for diverse but coherent outputs.
```python
outputs = model.generate(
**inputs,
max_new_tokens=50,
num_beams=5,
num_beam_groups=1, # Diverse beam search groups
diversity_penalty=0.0, # Penalty for similar beams
length_penalty=1.0, # >1: longer sequences, <1: shorter
early_stopping=True, # Stop when num_beams sequences finish
no_repeat_ngram_size=2, # Block repeating n-grams
num_return_sequences=1, # Return top-k sequences (≤ num_beams)
)
```
**Length Penalty:**
- `length_penalty > 1.0`: Favor longer sequences
- `length_penalty = 1.0`: No penalty
- `length_penalty < 1.0`: Favor shorter sequences
### Sampling (Multinomial)
Randomly sample tokens according to the probability distribution.
```python
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=True, # Enable sampling
temperature=1.0, # Sampling temperature
num_beams=1, # Must be 1 for sampling
)
```
**Characteristics:**
- Non-deterministic (different output each time)
- More diverse and creative than greedy/beam search
- Can produce incoherent output if not controlled
- Best for: Creative writing, dialogue, open-ended generation
**Temperature Parameter:**
```python
# Low temperature (0.1-0.7): More focused, less random
outputs = model.generate(**inputs, do_sample=True, temperature=0.5)
# Medium temperature (0.7-1.0): Balanced
outputs = model.generate(**inputs, do_sample=True, temperature=0.8)
# High temperature (1.0-2.0): More random, more creative
outputs = model.generate(**inputs, do_sample=True, temperature=1.5)
```
- `temperature → 0`: Approaches greedy decoding
- `temperature = 1.0`: Sample from original distribution
- `temperature > 1.0`: Flatter distribution, more random
- `temperature < 1.0`: Sharper distribution, more confident
## Advanced Sampling Methods
### Top-k Sampling
Sample from only the k most likely tokens.
```python
outputs = model.generate(
**inputs,
do_sample=True,
max_new_tokens=50,
top_k=50, # Consider top 50 tokens
temperature=0.8,
top_k=50,
)
```
**How it works:**
1. Filter to top-k most probable tokens
2. Renormalize probabilities
3. Sample from filtered distribution
### 5. Contrastive Search
**Choosing k:**
- `k=1`: Equivalent to greedy decoding
- `k=10-50`: More focused, coherent output
- `k=100-500`: More diverse output
- Too high k: Includes low-probability tokens (noise)
- Too low k: Less diverse, may miss good alternatives
### Top-p (Nucleus) Sampling
Sample from the smallest set of tokens whose cumulative probability ≥ p.
Balances coherence and diversity using contrastive objective.
```python
outputs = model.generate(
**inputs,
do_sample=True,
max_new_tokens=50,
top_p=0.95, # Nucleus probability
temperature=0.8,
penalty_alpha=0.6, # Contrastive penalty
top_k=4, # Consider top-k candidates
)
```
**How it works:**
1. Sort tokens by probability
2. Find smallest set with cumulative probability ≥ p
3. Sample from this set
### 6. Assisted Decoding
**Choosing p:**
- `p=0.9-0.95`: Good balance (recommended)
- `p=1.0`: Sample from full distribution
- Higher p: More diverse, might include unlikely tokens
- Lower p: More focused, like top-k with adaptive k
**Top-p vs Top-k:**
- Top-p adapts to probability distribution shape
- Top-k is fixed regardless of distribution
- Top-p generally better for variable-quality contexts
- Can combine: `top_k=50, top_p=0.95` (apply both filters)
### Combining Strategies
Uses a smaller "assistant" model to speed up generation of larger model.
```python
# Recommended for high-quality open-ended generation
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("gpt2-large")
assistant_model = AutoModelForCausalLM.from_pretrained("gpt2")
outputs = model.generate(
**inputs,
do_sample=True,
assistant_model=assistant_model,
max_new_tokens=50,
)
```
## GenerationConfig
Configure generation parameters with `GenerationConfig` for reusability.
```python
from transformers import GenerationConfig
generation_config = GenerationConfig(
max_new_tokens=100,
temperature=0.8, # Moderate temperature
top_k=50, # Limit to top 50 tokens
top_p=0.95, # Nucleus sampling
repetition_penalty=1.2, # Discourage repetition
no_repeat_ngram_size=3, # Block 3-gram repetition
do_sample=True,
temperature=0.7,
top_p=0.9,
top_k=50,
repetition_penalty=1.2,
no_repeat_ngram_size=3,
)
# Use with model
outputs = model.generate(**inputs, generation_config=generation_config)
# Save and load
generation_config.save_pretrained("./config")
loaded_config = GenerationConfig.from_pretrained("./config")
```
## Controlling Generation Quality
## Key Parameters Reference
### Output Length Control
- `max_length`: Maximum total tokens (input + output)
- `max_new_tokens`: Maximum new tokens to generate (recommended over max_length)
- `min_length`: Minimum total tokens
- `min_new_tokens`: Minimum new tokens to generate
### Sampling Parameters
- `temperature`: Sampling temperature (0.1-2.0, default 1.0)
- `top_k`: Top-k sampling (1-100, typically 50)
- `top_p`: Nucleus sampling (0.0-1.0, typically 0.9)
- `do_sample`: Enable sampling (True/False)
### Beam Search Parameters
- `num_beams`: Number of beams (1-20, typically 5)
- `early_stopping`: Stop when beams finish (True/False)
- `length_penalty`: Length penalty (>1.0 favors longer, <1.0 favors shorter)
- `num_beam_groups`: Diverse beam search groups
- `diversity_penalty`: Penalty for similar beams
### Repetition Control
Prevent models from repeating themselves:
- `repetition_penalty`: Penalty for repeating tokens (1.0-2.0, default 1.0)
- `no_repeat_ngram_size`: Prevent repeating n-grams (2-5 typical)
- `encoder_repetition_penalty`: Penalty for repeating encoder tokens
```python
outputs = model.generate(
**inputs,
max_new_tokens=100,
### Special Tokens
# Method 1: Repetition penalty
repetition_penalty=1.2, # Penalize repeated tokens (>1.0)
- `bos_token_id`: Beginning of sequence token
- `eos_token_id`: End of sequence token (or list of tokens)
- `pad_token_id`: Padding token
- `forced_bos_token_id`: Force specific token at beginning
- `forced_eos_token_id`: Force specific token at end
# Method 2: Block n-gram repetition
no_repeat_ngram_size=3, # Never repeat 3-grams
### Multiple Sequences
# Method 3: Encoder repetition penalty (for seq2seq)
encoder_repetition_penalty=1.0, # Penalize input tokens
)
```
- `num_return_sequences`: Number of sequences to return
- `num_beam_groups`: Number of diverse beam groups
**Repetition Penalty Values:**
- `1.0`: No penalty
- `1.0-1.5`: Mild penalty (recommended: 1.1-1.3)
- `>1.5`: Strong penalty (may harm coherence)
## Advanced Generation Techniques
### Length Control
### Constrained Generation
```python
outputs = model.generate(
**inputs,
# Hard constraints
min_length=20, # Minimum total length
max_length=100, # Maximum total length
max_new_tokens=50, # Maximum new tokens (excluding input)
# Soft constraints (with beam search)
length_penalty=1.0, # Encourage longer/shorter outputs
# Early stopping
early_stopping=True, # Stop when condition met
)
```
### Bad Words and Forced Tokens
```python
# Prevent specific tokens
bad_words_ids = [
tokenizer.encode("badword1", add_special_tokens=False),
tokenizer.encode("badword2", add_special_tokens=False),
]
outputs = model.generate(
**inputs,
bad_words_ids=bad_words_ids,
)
# Force specific tokens
force_words_ids = [
tokenizer.encode("important", add_special_tokens=False),
]
outputs = model.generate(
**inputs,
force_words_ids=force_words_ids,
)
```
## Streaming Generation
Generate and process tokens as they're produced:
```python
from transformers import TextStreamer, TextIteratorStreamer
from threading import Thread
# Simple streaming (prints to stdout)
streamer = TextStreamer(tokenizer, skip_prompt=True)
outputs = model.generate(**inputs, streamer=streamer, max_new_tokens=100)
# Iterator streaming (for custom processing)
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True)
generation_kwargs = dict(**inputs, streamer=streamer, max_new_tokens=100)
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
for text in streamer:
print(text, end="", flush=True)
thread.join()
```
## Advanced Techniques
### Contrastive Search
Balance coherence and diversity using contrastive objective:
```python
outputs = model.generate(
**inputs,
max_new_tokens=50,
penalty_alpha=0.6, # Contrastive penalty
top_k=4, # Consider top-4 tokens
)
```
**When to use:**
- Open-ended text generation
- Reduces repetition without sacrificing coherence
- Good alternative to sampling
### Diverse Beam Search
Generate multiple diverse outputs:
```python
outputs = model.generate(
**inputs,
max_new_tokens=50,
num_beams=10,
num_beam_groups=5, # 5 groups of 2 beams each
diversity_penalty=1.0, # Penalty for similar beams
num_return_sequences=5, # Return 5 diverse outputs
)
```
### Constrained Beam Search
Force output to include specific phrases:
Force generation to include specific tokens or follow patterns.
```python
from transformers import PhrasalConstraint
constraints = [
PhrasalConstraint(
tokenizer("machine learning", add_special_tokens=False).input_ids
),
PhrasalConstraint(tokenizer("New York", add_special_tokens=False).input_ids)
]
outputs = model.generate(
**inputs,
constraints=constraints,
num_beams=10, # Requires beam search
num_beams=5,
)
```
## Speculative Decoding
### Streaming Generation
Accelerate generation using a smaller draft model:
Generate tokens one at a time for real-time display.
```python
from transformers import AutoModelForCausalLM
from transformers import TextIteratorStreamer
from threading import Thread
# Load main and assistant models
model = AutoModelForCausalLM.from_pretrained("large-model")
assistant_model = AutoModelForCausalLM.from_pretrained("small-model")
streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)
generation_kwargs = dict(
**inputs,
max_new_tokens=100,
streamer=streamer,
)
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
for new_text in streamer:
print(new_text, end="", flush=True)
thread.join()
```
### Logit Processors
Customize token selection with custom logit processors.
```python
from transformers import LogitsProcessor, LogitsProcessorList
class CustomLogitsProcessor(LogitsProcessor):
def __call__(self, input_ids, scores):
# Modify scores here
return scores
logits_processor = LogitsProcessorList([CustomLogitsProcessor()])
# Generate with speculative decoding
outputs = model.generate(
**inputs,
assistant_model=assistant_model,
logits_processor=logits_processor,
)
```
### Stopping Criteria
Define custom stopping conditions.
```python
from transformers import StoppingCriteria, StoppingCriteriaList
class CustomStoppingCriteria(StoppingCriteria):
def __call__(self, input_ids, scores, **kwargs):
# Return True to stop generation
return False
stopping_criteria = StoppingCriteriaList([CustomStoppingCriteria()])
outputs = model.generate(
**inputs,
stopping_criteria=stopping_criteria,
)
```
## Best Practices
### For Creative Tasks (Stories, Dialogue)
```python
outputs = model.generate(
**inputs,
max_new_tokens=200,
do_sample=True,
temperature=0.8,
)
```
**Benefits:**
- 2-3x faster generation
- Identical output distribution to regular generation
- Works with sampling and greedy decoding
## Recipe: Recommended Settings by Task
### Creative Writing / Dialogue
```python
outputs = model.generate(
**inputs,
do_sample=True,
max_new_tokens=200,
temperature=0.9,
top_p=0.95,
top_k=50,
repetition_penalty=1.2,
no_repeat_ngram_size=3,
)
```
### Translation / Summarization
### For Factual Tasks (Summaries, QA)
```python
outputs = model.generate(
**inputs,
num_beams=5,
max_new_tokens=150,
max_new_tokens=100,
num_beams=4,
early_stopping=True,
length_penalty=1.0,
no_repeat_ngram_size=2,
length_penalty=1.0,
)
```
### Code Generation
### For Chat/Instruction Following
```python
outputs = model.generate(
**inputs,
max_new_tokens=300,
temperature=0.2, # Low temperature for correctness
top_p=0.95,
max_new_tokens=512,
do_sample=True,
)
```
### Chatbot / Instruction Following
```python
outputs = model.generate(
**inputs,
do_sample=True,
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.15,
repetition_penalty=1.1,
)
```
### Factual QA / Information Extraction
## Vision-Language Model Generation
For models like LLaVA, BLIP-2, etc.:
```python
outputs = model.generate(
**inputs,
max_new_tokens=50,
num_beams=3,
early_stopping=True,
# Or greedy for very short answers:
# (no special parameters needed)
)
```
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
## Debugging Generation
model = AutoModelForVision2Seq.from_pretrained("llava-hf/llava-1.5-7b-hf")
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
### Check Token Probabilities
```python
outputs = model.generate(
**inputs,
max_new_tokens=20,
output_scores=True, # Return generation scores
return_dict_in_generate=True, # Return as dict
)
# Access generation scores
scores = outputs.scores # Tuple of tensors (seq_len, vocab_size)
# Get token probabilities
import torch
probs = torch.softmax(scores[0], dim=-1)
```
### Monitor Generation Process
```python
from transformers import LogitsProcessor, LogitsProcessorList
class DebugLogitsProcessor(LogitsProcessor):
def __call__(self, input_ids, scores):
# Print top 5 tokens at each step
top_tokens = scores[0].topk(5)
print(f"Top 5 tokens: {top_tokens}")
return scores
image = Image.open("image.jpg")
inputs = processor(text="Describe this image", images=image, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=10,
logits_processor=LogitsProcessorList([DebugLogitsProcessor()]),
max_new_tokens=100,
do_sample=True,
temperature=0.7,
)
generated_text = processor.decode(outputs[0], skip_special_tokens=True)
```
## Common Issues and Solutions
**Issue: Repetitive output**
- Solution: Increase `repetition_penalty` (1.2-1.5), set `no_repeat_ngram_size=3`
- For sampling: Increase `temperature`, enable `top_p`
**Issue: Incoherent output**
- Solution: Lower `temperature` (0.5-0.8), use beam search
- Set `top_k=50` or `top_p=0.9` to filter unlikely tokens
**Issue: Too short output**
- Solution: Increase `min_length`, set `length_penalty > 1.0` (beam search)
- Check if EOS token is being generated early
**Issue: Too slow generation**
- Solution: Use greedy instead of beam search
- Reduce `num_beams`
- Try speculative decoding with assistant model
- Use smaller model variant
**Issue: Output doesn't follow format**
- Solution: Use constrained beam search
- Add format examples to prompt
- Use `bad_words_ids` to prevent format-breaking tokens
## Performance Optimization
### Use KV Cache
```python
# Use half precision
model = AutoModelForCausalLM.from_pretrained(
"model-name",
torch_dtype=torch.float16,
device_map="auto"
)
# Use KV cache optimization (default, but can be disabled)
# KV cache is enabled by default
outputs = model.generate(**inputs, use_cache=True)
# Batch generation
inputs = tokenizer(["Prompt 1", "Prompt 2"], return_tensors="pt", padding=True)
outputs = model.generate(**inputs, max_new_tokens=50)
# Static cache for longer sequences (if supported)
outputs = model.generate(**inputs, cache_implementation="static")
```
This guide covers the main generation strategies. For task-specific examples, see `task_patterns.md`.
### Mixed Precision
```python
import torch
with torch.cuda.amp.autocast():
outputs = model.generate(**inputs, max_new_tokens=100)
```
### Batch Generation
```python
texts = ["Prompt 1", "Prompt 2", "Prompt 3"]
inputs = tokenizer(texts, return_tensors="pt", padding=True)
outputs = model.generate(**inputs, max_new_tokens=50)
```
### Quantization
```python
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=quantization_config,
device_map="auto"
)
```

View File

@@ -0,0 +1,234 @@
# Transformers Pipelines
Pipelines provide a simple and optimized interface for inference across many machine learning tasks. They abstract away the complexity of tokenization, model invocation, and post-processing.
## Usage Pattern
```python
from transformers import pipeline
# Basic usage
classifier = pipeline("text-classification")
result = classifier("This movie was amazing!")
# With specific model
classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
result = classifier("This movie was amazing!")
```
## Natural Language Processing Pipelines
### Text Classification
```python
classifier = pipeline("text-classification")
classifier("I love this product!")
# [{'label': 'POSITIVE', 'score': 0.9998}]
```
### Zero-Shot Classification
```python
classifier = pipeline("zero-shot-classification")
classifier("This is about climate change", candidate_labels=["politics", "science", "sports"])
```
### Token Classification (NER)
```python
ner = pipeline("token-classification")
ner("My name is Sarah and I work at Microsoft in Seattle")
```
### Question Answering
```python
qa = pipeline("question-answering")
qa(question="What is the capital?", context="The capital of France is Paris.")
```
### Text Generation
```python
generator = pipeline("text-generation")
generator("Once upon a time", max_length=50)
```
### Text2Text Generation
```python
generator = pipeline("text2text-generation", model="t5-base")
generator("translate English to French: Hello")
```
### Summarization
```python
summarizer = pipeline("summarization")
summarizer("Long article text here...", max_length=130, min_length=30)
```
### Translation
```python
translator = pipeline("translation_en_to_fr")
translator("Hello, how are you?")
```
### Fill Mask
```python
unmasker = pipeline("fill-mask")
unmasker("Paris is the [MASK] of France.")
```
### Feature Extraction
```python
extractor = pipeline("feature-extraction")
embeddings = extractor("This is a sentence")
```
### Document Question Answering
```python
doc_qa = pipeline("document-question-answering")
doc_qa(image="document.png", question="What is the invoice number?")
```
### Table Question Answering
```python
table_qa = pipeline("table-question-answering")
table_qa(table=data, query="How many employees?")
```
## Computer Vision Pipelines
### Image Classification
```python
classifier = pipeline("image-classification")
classifier("cat.jpg")
```
### Zero-Shot Image Classification
```python
classifier = pipeline("zero-shot-image-classification")
classifier("cat.jpg", candidate_labels=["cat", "dog", "bird"])
```
### Object Detection
```python
detector = pipeline("object-detection")
detector("street.jpg")
```
### Image Segmentation
```python
segmenter = pipeline("image-segmentation")
segmenter("image.jpg")
```
### Image-to-Image
```python
img2img = pipeline("image-to-image", model="lllyasviel/sd-controlnet-canny")
img2img("input.jpg")
```
### Depth Estimation
```python
depth = pipeline("depth-estimation")
depth("image.jpg")
```
### Video Classification
```python
classifier = pipeline("video-classification")
classifier("video.mp4")
```
### Keypoint Matching
```python
matcher = pipeline("keypoint-matching")
matcher(image1="img1.jpg", image2="img2.jpg")
```
## Audio Pipelines
### Automatic Speech Recognition
```python
asr = pipeline("automatic-speech-recognition")
asr("audio.wav")
```
### Audio Classification
```python
classifier = pipeline("audio-classification")
classifier("audio.wav")
```
### Zero-Shot Audio Classification
```python
classifier = pipeline("zero-shot-audio-classification")
classifier("audio.wav", candidate_labels=["speech", "music", "noise"])
```
### Text-to-Audio/Text-to-Speech
```python
synthesizer = pipeline("text-to-audio")
audio = synthesizer("Hello, how are you today?")
```
## Multimodal Pipelines
### Image-to-Text (Image Captioning)
```python
captioner = pipeline("image-to-text")
captioner("image.jpg")
```
### Visual Question Answering
```python
vqa = pipeline("visual-question-answering")
vqa(image="image.jpg", question="What color is the car?")
```
### Image-Text-to-Text (VLMs)
```python
vlm = pipeline("image-text-to-text")
vlm(images="image.jpg", text="Describe this image in detail")
```
### Zero-Shot Object Detection
```python
detector = pipeline("zero-shot-object-detection")
detector("image.jpg", candidate_labels=["car", "person", "tree"])
```
## Pipeline Configuration
### Common Parameters
- `model`: Specify model identifier or path
- `device`: Set device (0 for GPU, -1 for CPU, or "cuda:0")
- `batch_size`: Process multiple inputs at once
- `torch_dtype`: Set precision (torch.float16, torch.bfloat16)
```python
# GPU with half precision
pipe = pipeline("text-generation", model="gpt2", device=0, torch_dtype=torch.float16)
# Batch processing
pipe(["text 1", "text 2", "text 3"], batch_size=8)
```
### Task-Specific Parameters
Each pipeline accepts task-specific parameters in the call:
```python
# Text generation
generator("prompt", max_length=100, temperature=0.7, top_p=0.9, num_return_sequences=3)
# Summarization
summarizer("text", max_length=130, min_length=30, do_sample=False)
# Translation
translator("text", max_length=512, num_beams=4)
```
## Best Practices
1. **Reuse pipelines**: Create once, use multiple times for efficiency
2. **Batch processing**: Use batches for multiple inputs to maximize throughput
3. **GPU acceleration**: Set `device=0` for GPU when available
4. **Model selection**: Choose task-specific models for best results
5. **Memory management**: Use `torch_dtype=torch.float16` for large models

View File

@@ -1,504 +0,0 @@
# Model Quantization Guide
Comprehensive guide to reducing model memory footprint through quantization while maintaining accuracy.
## Overview
Quantization reduces memory requirements by storing model weights in lower precision formats (int8, int4) instead of full precision (float32). This enables:
- Running larger models on limited hardware
- Faster inference (reduced memory bandwidth)
- Lower deployment costs
- Enabling fine-tuning of models that wouldn't fit in memory
**Tradeoffs:**
- Slight accuracy loss (typically < 1-2%)
- Initial quantization overhead
- Some methods require calibration data
## Quick Comparison
| Method | Precision | Speed | Accuracy | Fine-tuning | Hardware | Setup |
|--------|-----------|-------|----------|-------------|----------|-------|
| **Bitsandbytes** | 4/8-bit | Fast | High | Yes (PEFT) | CUDA, CPU | Easy |
| **GPTQ** | 2-8-bit | Very Fast | High | Limited | CUDA, ROCm, Metal | Medium |
| **AWQ** | 4-bit | Very Fast | High | Yes (PEFT) | CUDA, ROCm | Medium |
| **GGUF** | 1-8-bit | Medium | Variable | No | CPU-optimized | Easy |
| **HQQ** | 1-8-bit | Fast | High | Yes | Multi-platform | Medium |
## Bitsandbytes (BnB)
On-the-fly quantization with excellent PEFT fine-tuning support.
### 8-bit Quantization
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
load_in_8bit=True, # Enable 8-bit quantization
device_map="auto", # Automatic device placement
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
# Use normally
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=50)
```
**Memory Savings:**
- 7B model: ~14GB → ~7GB (50% reduction)
- 13B model: ~26GB → ~13GB
- 70B model: ~140GB → ~70GB
**Characteristics:**
- Fast inference
- Minimal accuracy loss
- Works with PEFT (LoRA, QLoRA)
- Supports CPU and CUDA GPUs
### 4-bit Quantization (QLoRA)
```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, # Enable 4-bit quantization
bnb_4bit_quant_type="nf4", # Quantization type ("nf4" or "fp4")
bnb_4bit_compute_dtype=torch.float16, # Computation dtype
bnb_4bit_use_double_quant=True, # Nested quantization for more savings
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto",
)
```
**Memory Savings:**
- 7B model: ~14GB → ~4GB (70% reduction)
- 13B model: ~26GB → ~7GB
- 70B model: ~140GB → ~35GB
**Quantization Types:**
- `nf4`: Normal Float 4 (recommended, better quality)
- `fp4`: Float Point 4 (slightly more memory efficient)
**Compute Dtype:**
```python
# For better quality
bnb_4bit_compute_dtype=torch.float16
# For best performance on Ampere+ GPUs
bnb_4bit_compute_dtype=torch.bfloat16
```
**Double Quantization:**
```python
# Enable for additional ~0.4 bits/param savings
bnb_4bit_use_double_quant=True # Quantize the quantization constants
```
### Fine-tuning with QLoRA
```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch
# Load quantized model
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto",
)
# Prepare for training
model = prepare_model_for_kbit_training(model)
# Configure LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
# Train normally
trainer = Trainer(model=model, args=training_args, ...)
trainer.train()
```
## GPTQ
Post-training quantization requiring calibration, optimized for inference speed.
### Loading GPTQ Models
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
# Load pre-quantized GPTQ model
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-7B-GPTQ", # Pre-quantized model
device_map="auto",
revision="gptq-4bit-32g-actorder_True", # Specific quantization config
)
# Or quantize yourself
gptq_config = GPTQConfig(
bits=4, # 2, 3, 4, 8 bits
dataset="c4", # Calibration dataset
tokenizer=tokenizer,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
device_map="auto",
quantization_config=gptq_config,
)
# Save quantized model
model.save_pretrained("llama-2-7b-gptq")
```
**Configuration Options:**
```python
gptq_config = GPTQConfig(
bits=4, # Quantization bits
group_size=128, # Group size for quantization (128, 32, -1)
dataset="c4", # Calibration dataset
desc_act=False, # Activation order (can improve accuracy)
sym=True, # Symmetric quantization
damp_percent=0.1, # Dampening factor
)
```
**Characteristics:**
- Fastest inference among quantization methods
- Requires one-time calibration (slow)
- Best when using pre-quantized models from Hub
- Limited fine-tuning support
- Excellent for production deployment
## AWQ (Activation-aware Weight Quantization)
Protects important weights for better quality.
### Loading AWQ Models
```python
from transformers import AutoModelForCausalLM, AwqConfig
# Load pre-quantized AWQ model
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-7B-AWQ",
device_map="auto",
)
# Or quantize yourself
awq_config = AwqConfig(
bits=4, # 4-bit quantization
group_size=128, # Quantization group size
zero_point=True, # Use zero-point quantization
version="GEMM", # Quantization version
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=awq_config,
device_map="auto",
)
```
**Characteristics:**
- Better accuracy than GPTQ at same bit width
- Excellent inference speed
- Supports PEFT fine-tuning
- Requires calibration data
### Fine-tuning AWQ Models
```python
from peft import LoraConfig, get_peft_model
# AWQ models support LoRA fine-tuning
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
trainer = Trainer(model=model, ...)
trainer.train()
```
## GGUF (GGML Format)
CPU-optimized quantization format, popular in llama.cpp ecosystem.
### Using GGUF Models
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load GGUF model
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-7B-GGUF",
gguf_file="llama-2-7b.Q4_K_M.gguf", # Specific quantization file
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-7B-GGUF")
```
**GGUF Quantization Types:**
- `Q4_0`: 4-bit, smallest, lowest quality
- `Q4_K_M`: 4-bit, medium quality (recommended)
- `Q5_K_M`: 5-bit, good quality
- `Q6_K`: 6-bit, high quality
- `Q8_0`: 8-bit, very high quality
**Characteristics:**
- Optimized for CPU inference
- Wide range of bit depths (1-8)
- Good for Apple Silicon (M1/M2)
- No fine-tuning support
- Excellent for local/edge deployment
## HQQ (Half-Quadratic Quantization)
Flexible quantization with good accuracy retention.
### Using HQQ
```python
from transformers import AutoModelForCausalLM, HqqConfig
hqq_config = HqqConfig(
nbits=4, # Quantization bits
group_size=64, # Group size
quant_zero=False, # Quantize zero point
quant_scale=False, # Quantize scale
axis=0, # Quantization axis
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=hqq_config,
device_map="auto",
)
```
**Characteristics:**
- Very fast quantization
- No calibration data needed
- Support for 1-8 bits
- Can serialize/deserialize
- Good accuracy vs size tradeoff
## Choosing a Quantization Method
### Decision Tree
**For inference only:**
1. Need fastest inference? → **GPTQ or AWQ** (use pre-quantized models)
2. CPU-only deployment? → **GGUF**
3. Want easiest setup? → **Bitsandbytes 8-bit**
4. Need extreme compression? → **GGUF Q4_0 or HQQ 2-bit**
**For fine-tuning:**
1. Limited VRAM? → **QLoRA (BnB 4-bit + LoRA)**
2. Want best accuracy? → **Bitsandbytes 8-bit + LoRA**
3. Need very large models? → **QLoRA with double quantization**
**For production:**
1. Latency-critical? → **GPTQ or AWQ**
2. Cost-optimized? → **Bitsandbytes 8-bit**
3. CPU deployment? → **GGUF**
## Memory Requirements
Approximate memory for Llama-2 7B model:
| Method | Memory | vs FP16 |
|--------|--------|---------|
| FP32 | 28GB | 2x |
| FP16 / BF16 | 14GB | 1x |
| 8-bit (BnB) | 7GB | 0.5x |
| 4-bit (QLoRA) | 3.5GB | 0.25x |
| 4-bit Double Quant | 3GB | 0.21x |
| GPTQ 4-bit | 4GB | 0.29x |
| AWQ 4-bit | 4GB | 0.29x |
**Note:** Add ~1-2GB for inference activations, KV cache, and framework overhead.
## Best Practices
### For Training
```python
# QLoRA recommended configuration
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16, # BF16 if available
bnb_4bit_use_double_quant=True,
)
# LoRA configuration
lora_config = LoraConfig(
r=16, # Rank (8, 16, 32, 64)
lora_alpha=32, # Scaling (typically 2*r)
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
```
### For Inference
```python
# High-speed inference
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-7B-GPTQ",
device_map="auto",
torch_dtype=torch.float16, # Use FP16 for activations
)
# Balanced quality/speed
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
load_in_8bit=True,
device_map="auto",
)
# Maximum compression
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
),
device_map="auto",
)
```
### Multi-GPU Setups
```python
# Automatically distribute across GPUs
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b-hf",
load_in_4bit=True,
device_map="auto", # Automatic distribution
max_memory={0: "20GB", 1: "20GB"}, # Optional: limit per GPU
)
# Manual device map
device_map = {
"model.embed_tokens": 0,
"model.layers.0": 0,
"model.layers.1": 0,
# ... distribute layers ...
"model.norm": 1,
"lm_head": 1,
}
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b-hf",
load_in_4bit=True,
device_map=device_map,
)
```
## Troubleshooting
**Issue: OOM during quantization**
```python
# Solution: Use low_cpu_mem_usage
model = AutoModelForCausalLM.from_pretrained(
"model-name",
quantization_config=config,
device_map="auto",
low_cpu_mem_usage=True, # Reduce CPU memory during loading
)
```
**Issue: Slow quantization**
```python
# GPTQ/AWQ take time to calibrate
# Solution: Use pre-quantized models from Hub
model = AutoModelForCausalLM.from_pretrained("TheBloke/Model-GPTQ")
# Or use BnB for instant quantization
model = AutoModelForCausalLM.from_pretrained("model-name", load_in_4bit=True)
```
**Issue: Poor quality after quantization**
```python
# Try different quantization types
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # Try "nf4" instead of "fp4"
bnb_4bit_compute_dtype=torch.bfloat16, # Use BF16 if available
)
# Or use 8-bit instead of 4-bit
model = AutoModelForCausalLM.from_pretrained("model-name", load_in_8bit=True)
```
**Issue: Can't fine-tune quantized model**
```python
# Ensure using compatible quantization method
from peft import prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model)
# Only BnB and AWQ support PEFT fine-tuning
# GPTQ has limited support, GGUF doesn't support fine-tuning
```
## Performance Benchmarks
Approximate generation speed (tokens/sec) for Llama-2 7B on A100 40GB:
| Method | Speed | Memory |
|--------|-------|--------|
| FP16 | 100 tok/s | 14GB |
| 8-bit | 90 tok/s | 7GB |
| 4-bit QLoRA | 70 tok/s | 4GB |
| GPTQ 4-bit | 95 tok/s | 4GB |
| AWQ 4-bit | 95 tok/s | 4GB |
**Note:** Actual performance varies by hardware, sequence length, and batch size.
## Resources
- **Pre-quantized models:** Search "GPTQ" or "AWQ" on Hugging Face Hub
- **BnB documentation:** https://github.com/TimDettmers/bitsandbytes
- **PEFT library:** https://github.com/huggingface/peft
- **QLoRA paper:** https://arxiv.org/abs/2305.14314
For task-specific quantization examples, see `training_guide.md`.

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,328 @@
# Training with Transformers
Transformers provides comprehensive training capabilities through the `Trainer` API, supporting distributed training, mixed precision, and advanced optimization techniques.
## Basic Training Workflow
```python
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
Trainer,
TrainingArguments
)
from datasets import load_dataset
# 1. Load and preprocess data
dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
# 2. Load model
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=2
)
# 3. Define training arguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
learning_rate=2e-5,
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
)
# 4. Create trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["test"],
)
# 5. Train
trainer.train()
# 6. Evaluate
trainer.evaluate()
# 7. Save model
trainer.save_model("./final_model")
```
## TrainingArguments Configuration
### Essential Parameters
**Output and Logging:**
- `output_dir`: Directory for checkpoints and outputs (required)
- `logging_dir`: TensorBoard log directory (default: `{output_dir}/runs`)
- `logging_steps`: Log every N steps (default: 500)
- `logging_strategy`: "steps" or "epoch"
**Training Duration:**
- `num_train_epochs`: Number of epochs (default: 3.0)
- `max_steps`: Max training steps (overrides num_train_epochs if set)
**Batch Size and Gradient Accumulation:**
- `per_device_train_batch_size`: Batch size per device (default: 8)
- `per_device_eval_batch_size`: Eval batch size per device (default: 8)
- `gradient_accumulation_steps`: Accumulate gradients over N steps (default: 1)
- Effective batch size = `per_device_train_batch_size * gradient_accumulation_steps * num_gpus`
**Learning Rate:**
- `learning_rate`: Peak learning rate (default: 5e-5)
- `lr_scheduler_type`: Scheduler type ("linear", "cosine", "constant", etc.)
- `warmup_steps`: Warmup steps (default: 0)
- `warmup_ratio`: Warmup as fraction of total steps
**Evaluation:**
- `eval_strategy`: "no", "steps", or "epoch" (default: "no")
- `eval_steps`: Evaluate every N steps (if eval_strategy="steps")
- `eval_delay`: Delay evaluation until N steps
**Checkpointing:**
- `save_strategy`: "no", "steps", or "epoch" (default: "steps")
- `save_steps`: Save checkpoint every N steps (default: 500)
- `save_total_limit`: Keep only N most recent checkpoints
- `load_best_model_at_end`: Load best checkpoint at end (default: False)
- `metric_for_best_model`: Metric to determine best model
**Optimization:**
- `optim`: Optimizer ("adamw_torch", "adamw_hf", "sgd", etc.)
- `weight_decay`: Weight decay coefficient (default: 0.0)
- `adam_beta1`, `adam_beta2`: Adam optimizer betas
- `adam_epsilon`: Epsilon for Adam (default: 1e-8)
- `max_grad_norm`: Max gradient norm for clipping (default: 1.0)
### Mixed Precision Training
```python
training_args = TrainingArguments(
output_dir="./results",
fp16=True, # Use fp16 on NVIDIA GPUs
fp16_opt_level="O1", # O0, O1, O2, O3 (Apex levels)
# or
bf16=True, # Use bf16 on Ampere+ GPUs (better than fp16)
)
```
### Distributed Training
**DataParallel (single-node multi-GPU):**
```python
# Automatic with multiple GPUs
training_args = TrainingArguments(
output_dir="./results",
per_device_train_batch_size=16, # Per GPU
)
# Run: python script.py
```
**DistributedDataParallel (multi-node or multi-GPU):**
```bash
# Single node, multiple GPUs
python -m torch.distributed.launch --nproc_per_node=4 script.py
# Or use accelerate
accelerate config
accelerate launch script.py
```
**DeepSpeed Integration:**
```python
training_args = TrainingArguments(
output_dir="./results",
deepspeed="ds_config.json", # DeepSpeed config file
)
```
### Advanced Features
**Gradient Checkpointing (reduce memory):**
```python
training_args = TrainingArguments(
output_dir="./results",
gradient_checkpointing=True,
)
```
**Compilation with torch.compile:**
```python
training_args = TrainingArguments(
output_dir="./results",
torch_compile=True,
torch_compile_backend="inductor", # or "cudagraphs"
)
```
**Push to Hub:**
```python
training_args = TrainingArguments(
output_dir="./results",
push_to_hub=True,
hub_model_id="username/model-name",
hub_strategy="every_save", # or "end"
)
```
## Custom Training Components
### Custom Metrics
```python
import evaluate
import numpy as np
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
trainer = Trainer(
model=model,
args=training_args,
compute_metrics=compute_metrics,
)
```
### Custom Loss Function
```python
class CustomTrainer(Trainer):
def compute_loss(self, model, inputs, return_outputs=False):
labels = inputs.pop("labels")
outputs = model(**inputs)
logits = outputs.logits
# Custom loss calculation
loss_fct = torch.nn.CrossEntropyLoss(weight=class_weights)
loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
return (loss, outputs) if return_outputs else loss
```
### Data Collator
```python
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
data_collator=data_collator,
)
```
### Callbacks
```python
from transformers import TrainerCallback
class CustomCallback(TrainerCallback):
def on_epoch_end(self, args, state, control, **kwargs):
print(f"Epoch {state.epoch} completed!")
return control
trainer = Trainer(
model=model,
args=training_args,
callbacks=[CustomCallback],
)
```
## Hyperparameter Search
```python
def model_init():
return AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=2
)
trainer = Trainer(
model_init=model_init,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
)
# Optuna-based search
best_trial = trainer.hyperparameter_search(
direction="maximize",
backend="optuna",
n_trials=10,
hp_space=lambda trial: {
"learning_rate": trial.suggest_float("learning_rate", 1e-5, 5e-5, log=True),
"per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 16, 32]),
"num_train_epochs": trial.suggest_int("num_train_epochs", 2, 5),
}
)
```
## Training Best Practices
1. **Start with small learning rates**: 2e-5 to 5e-5 for fine-tuning
2. **Use warmup**: 5-10% of total steps for learning rate warmup
3. **Monitor training**: Use eval_strategy="epoch" or "steps" to track progress
4. **Save checkpoints**: Set save_strategy and save_total_limit
5. **Use mixed precision**: Enable fp16 or bf16 for faster training
6. **Gradient accumulation**: For large effective batch sizes on limited memory
7. **Load best model**: Set load_best_model_at_end=True to avoid overfitting
8. **Push to Hub**: Enable push_to_hub for easy model sharing and versioning
## Common Training Patterns
### Classification
```python
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=num_classes,
id2label=id2label,
label2id=label2id
)
```
### Question Answering
```python
model = AutoModelForQuestionAnswering.from_pretrained("bert-base-uncased")
```
### Token Classification (NER)
```python
model = AutoModelForTokenClassification.from_pretrained(
"bert-base-uncased",
num_labels=num_tags,
id2label=id2label,
label2id=label2id
)
```
### Sequence-to-Sequence
```python
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
```
### Causal Language Modeling
```python
model = AutoModelForCausalLM.from_pretrained("gpt2")
```
### Masked Language Modeling
```python
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")
```

View File

@@ -1,19 +1,12 @@
#!/usr/bin/env python3
"""
Complete example for fine-tuning a text classification model.
Fine-tune a transformer model for text classification.
This script demonstrates the full workflow:
1. Load dataset
2. Preprocess with tokenizer
3. Configure model
4. Train with Trainer
5. Evaluate and save
Usage:
python fine_tune_classifier.py --model bert-base-uncased --dataset imdb --epochs 3
This script demonstrates the complete workflow for fine-tuning a pre-trained
model on a classification task using the Trainer API.
"""
import argparse
import numpy as np
from datasets import load_dataset
from transformers import (
AutoTokenizer,
@@ -23,189 +16,225 @@ from transformers import (
DataCollatorWithPadding,
)
import evaluate
import numpy as np
def compute_metrics(eval_pred):
"""Compute accuracy and F1 score."""
metric_accuracy = evaluate.load("accuracy")
metric_f1 = evaluate.load("f1")
def load_and_prepare_data(dataset_name="imdb", model_name="distilbert-base-uncased", max_samples=None):
"""
Load dataset and tokenize.
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
Args:
dataset_name: Name of the dataset to load
model_name: Name of the model/tokenizer to use
max_samples: Limit number of samples (for quick testing)
accuracy = metric_accuracy.compute(predictions=predictions, references=labels)
f1 = metric_f1.compute(predictions=predictions, references=labels)
Returns:
tokenized_datasets, tokenizer
"""
print(f"Loading dataset: {dataset_name}")
dataset = load_dataset(dataset_name)
return {"accuracy": accuracy["accuracy"], "f1": f1["f1"]}
# Optionally limit samples for quick testing
if max_samples:
dataset["train"] = dataset["train"].select(range(max_samples))
dataset["test"] = dataset["test"].select(range(min(max_samples, len(dataset["test"]))))
print(f"Loading tokenizer: {model_name}")
tokenizer = AutoTokenizer.from_pretrained(model_name)
def tokenize_function(examples):
return tokenizer(
examples["text"],
padding="max_length",
truncation=True,
max_length=512
)
print("Tokenizing dataset...")
tokenized_datasets = dataset.map(tokenize_function, batched=True)
return tokenized_datasets, tokenizer
def main():
parser = argparse.ArgumentParser(description="Fine-tune a text classification model")
parser.add_argument(
"--model",
type=str,
default="bert-base-uncased",
help="Pretrained model name or path",
)
parser.add_argument(
"--dataset",
type=str,
default="imdb",
help="Dataset name from Hugging Face Hub",
)
parser.add_argument(
"--max-samples",
type=int,
default=None,
help="Maximum samples to use (for quick testing)",
)
parser.add_argument(
"--output-dir",
type=str,
default="./results",
help="Output directory for checkpoints",
)
parser.add_argument(
"--epochs",
type=int,
default=3,
help="Number of training epochs",
)
parser.add_argument(
"--batch-size",
type=int,
default=16,
help="Batch size per device",
)
parser.add_argument(
"--learning-rate",
type=float,
default=2e-5,
help="Learning rate",
)
parser.add_argument(
"--push-to-hub",
action="store_true",
help="Push model to Hugging Face Hub after training",
)
def create_model(model_name, num_labels, id2label, label2id):
"""
Create classification model.
args = parser.parse_args()
print("=" * 60)
print("Text Classification Fine-Tuning")
print("=" * 60)
print(f"Model: {args.model}")
print(f"Dataset: {args.dataset}")
print(f"Epochs: {args.epochs}")
print(f"Batch size: {args.batch_size}")
print(f"Learning rate: {args.learning_rate}")
print("=" * 60)
# 1. Load dataset
print("\n[1/5] Loading dataset...")
dataset = load_dataset(args.dataset)
if args.max_samples:
dataset["train"] = dataset["train"].select(range(args.max_samples))
dataset["test"] = dataset["test"].select(range(args.max_samples // 5))
print(f"Train samples: {len(dataset['train'])}")
print(f"Test samples: {len(dataset['test'])}")
# 2. Preprocess
print("\n[2/5] Preprocessing data...")
tokenizer = AutoTokenizer.from_pretrained(args.model)
def preprocess_function(examples):
return tokenizer(examples["text"], truncation=True, max_length=512)
tokenized_dataset = dataset.map(preprocess_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
# 3. Load model
print("\n[3/5] Loading model...")
# Determine number of labels
num_labels = len(set(dataset["train"]["label"]))
Args:
model_name: Name of the pre-trained model
num_labels: Number of classification labels
id2label: Dictionary mapping label IDs to names
label2id: Dictionary mapping label names to IDs
Returns:
model
"""
print(f"Loading model: {model_name}")
model = AutoModelForSequenceClassification.from_pretrained(
args.model,
model_name,
num_labels=num_labels,
id2label=id2label,
label2id=label2id
)
return model
print(f"Number of labels: {num_labels}")
print(f"Model parameters: {model.num_parameters():,}")
# 4. Configure training
print("\n[4/5] Configuring training...")
def define_compute_metrics(metric_name="accuracy"):
"""
Define function to compute metrics during evaluation.
Args:
metric_name: Name of the metric to use
Returns:
compute_metrics function
"""
metric = evaluate.load(metric_name)
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
return compute_metrics
def train_model(model, tokenizer, train_dataset, eval_dataset, output_dir="./results"):
"""
Train the model.
Args:
model: The model to train
tokenizer: The tokenizer
train_dataset: Training dataset
eval_dataset: Evaluation dataset
output_dir: Directory for checkpoints and logs
Returns:
trained model, trainer
"""
# Define training arguments
training_args = TrainingArguments(
output_dir=args.output_dir,
learning_rate=args.learning_rate,
per_device_train_batch_size=args.batch_size,
per_device_eval_batch_size=args.batch_size,
num_train_epochs=args.epochs,
output_dir=output_dir,
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
learning_rate=2e-5,
weight_decay=0.01,
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
push_to_hub=args.push_to_hub,
metric_for_best_model="accuracy",
logging_dir=f"{output_dir}/logs",
logging_steps=100,
save_total_limit=2,
fp16=False, # Set to True if using GPU with fp16 support
)
# Create data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
# Create trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["test"],
tokenizer=tokenizer,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=data_collator,
compute_metrics=compute_metrics,
compute_metrics=define_compute_metrics("accuracy"),
)
# 5. Train
print("\n[5/5] Training...")
print("-" * 60)
# Train
print("\nStarting training...")
trainer.train()
# Evaluate
print("\n" + "=" * 60)
print("Final Evaluation")
print("=" * 60)
metrics = trainer.evaluate()
print("\nEvaluating model...")
eval_results = trainer.evaluate()
print(f"Evaluation results: {eval_results}")
print(f"Accuracy: {metrics['eval_accuracy']:.4f}")
print(f"F1 Score: {metrics['eval_f1']:.4f}")
print(f"Loss: {metrics['eval_loss']:.4f}")
return model, trainer
# Save
print("\n" + "=" * 60)
print(f"Saving model to {args.output_dir}")
trainer.save_model(args.output_dir)
tokenizer.save_pretrained(args.output_dir)
if args.push_to_hub:
print("Pushing to Hugging Face Hub...")
trainer.push_to_hub()
def test_inference(model, tokenizer, id2label):
"""
Test the trained model with sample texts.
Args:
model: Trained model
tokenizer: Tokenizer
id2label: Dictionary mapping label IDs to names
"""
print("\n=== Testing Inference ===")
test_texts = [
"This movie was absolutely fantastic! I loved every minute of it.",
"Terrible film. Waste of time and money.",
"It was okay, nothing special but not bad either."
]
for text in test_texts:
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)
predictions = outputs.logits.argmax(-1)
predicted_label = id2label[predictions.item()]
confidence = outputs.logits.softmax(-1).max().item()
print(f"\nText: {text}")
print(f"Prediction: {predicted_label} (confidence: {confidence:.3f})")
def main():
"""Main training pipeline."""
# Configuration
DATASET_NAME = "imdb"
MODEL_NAME = "distilbert-base-uncased"
OUTPUT_DIR = "./results"
MAX_SAMPLES = None # Set to a small number (e.g., 1000) for quick testing
# Label mapping
id2label = {0: "negative", 1: "positive"}
label2id = {"negative": 0, "positive": 1}
num_labels = len(id2label)
print("=" * 60)
print("Training complete!")
print("Fine-Tuning Text Classification Model")
print("=" * 60)
# Quick inference example
print("\nQuick inference example:")
from transformers import pipeline
classifier = pipeline(
"text-classification",
model=args.output_dir,
tokenizer=args.output_dir,
# Load and prepare data
tokenized_datasets, tokenizer = load_and_prepare_data(
dataset_name=DATASET_NAME,
model_name=MODEL_NAME,
max_samples=MAX_SAMPLES
)
example_text = "This is a great example of how to use transformers!"
result = classifier(example_text)
print(f"Text: {example_text}")
print(f"Prediction: {result[0]['label']} (score: {result[0]['score']:.4f})")
# Create model
model = create_model(
model_name=MODEL_NAME,
num_labels=num_labels,
id2label=id2label,
label2id=label2id
)
# Train model
model, trainer = train_model(
model=model,
tokenizer=tokenizer,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["test"],
output_dir=OUTPUT_DIR
)
# Save final model
print(f"\nSaving model to {OUTPUT_DIR}/final_model")
trainer.save_model(f"{OUTPUT_DIR}/final_model")
tokenizer.save_pretrained(f"{OUTPUT_DIR}/final_model")
# Test inference
test_inference(model, tokenizer, id2label)
print("\n" + "=" * 60)
print("Training completed successfully!")
print("=" * 60)
if __name__ == "__main__":

309
scientific-packages/transformers/scripts/generate_text.py Executable file → Normal file
View File

@@ -1,231 +1,188 @@
#!/usr/bin/env python3
"""
Text generation with various strategies.
Text generation with different decoding strategies.
This script demonstrates different generation strategies:
- Greedy decoding
- Beam search
- Sampling with temperature
- Top-k and top-p sampling
Usage:
python generate_text.py --model gpt2 --prompt "The future of AI" --strategy sampling
This script demonstrates various text generation approaches using
different sampling and decoding strategies.
"""
import argparse
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
def generate_with_greedy(model, tokenizer, prompt, max_length):
"""Greedy decoding (deterministic)."""
print("\n" + "=" * 60)
print("GREEDY DECODING")
print("=" * 60)
def load_model_and_tokenizer(model_name="gpt2"):
"""
Load model and tokenizer.
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
Args:
model_name: Name of the model to load
Returns:
model, tokenizer
"""
print(f"Loading model: {model_name}")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Set pad token if not already set
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
return model, tokenizer
def generate_with_greedy(model, tokenizer, prompt, max_new_tokens=50):
"""Greedy decoding - always picks highest probability token."""
print("\n=== Greedy Decoding ===")
print(f"Prompt: {prompt}")
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=max_length,
pad_token_id=tokenizer.eos_token_id,
max_new_tokens=max_new_tokens,
do_sample=False,
num_beams=1,
pad_token_id=tokenizer.pad_token_id
)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"\nPrompt: {prompt}")
print(f"\nGenerated:\n{text}")
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated: {generated_text}\n")
def generate_with_beam_search(model, tokenizer, prompt, max_length, num_beams=5):
"""Beam search for higher quality."""
print("\n" + "=" * 60)
print(f"BEAM SEARCH (num_beams={num_beams})")
print("=" * 60)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
def generate_with_beam_search(model, tokenizer, prompt, max_new_tokens=50, num_beams=5):
"""Beam search - explores multiple hypotheses."""
print("\n=== Beam Search ===")
print(f"Prompt: {prompt}")
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=max_length,
max_new_tokens=max_new_tokens,
num_beams=num_beams,
early_stopping=True,
no_repeat_ngram_size=2,
pad_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id
)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"\nPrompt: {prompt}")
print(f"\nGenerated:\n{text}")
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated: {generated_text}\n")
def generate_with_sampling(model, tokenizer, prompt, max_length, temperature=0.8):
"""Sampling with temperature."""
print("\n" + "=" * 60)
print(f"SAMPLING (temperature={temperature})")
print("=" * 60)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
def generate_with_sampling(model, tokenizer, prompt, max_new_tokens=50,
temperature=0.7, top_k=50, top_p=0.9):
"""Sampling with temperature, top-k, and nucleus (top-p) sampling."""
print("\n=== Sampling (Temperature + Top-K + Top-P) ===")
print(f"Prompt: {prompt}")
print(f"Parameters: temperature={temperature}, top_k={top_k}, top_p={top_p}")
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=max_length,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=temperature,
pad_token_id=tokenizer.eos_token_id,
)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"\nPrompt: {prompt}")
print(f"\nGenerated:\n{text}")
def generate_with_top_k_top_p(model, tokenizer, prompt, max_length, top_k=50, top_p=0.95, temperature=0.8):
"""Top-k and top-p (nucleus) sampling."""
print("\n" + "=" * 60)
print(f"TOP-K TOP-P SAMPLING (k={top_k}, p={top_p}, temp={temperature})")
print("=" * 60)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=max_length,
do_sample=True,
top_k=top_k,
top_p=top_p,
temperature=temperature,
repetition_penalty=1.2,
no_repeat_ngram_size=3,
pad_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id
)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"\nPrompt: {prompt}")
print(f"\nGenerated:\n{text}")
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated: {generated_text}\n")
def generate_multiple(model, tokenizer, prompt, max_length, num_sequences=3):
def generate_multiple_sequences(model, tokenizer, prompt, max_new_tokens=50,
num_return_sequences=3):
"""Generate multiple diverse sequences."""
print("\n" + "=" * 60)
print(f"MULTIPLE SEQUENCES (n={num_sequences})")
print("=" * 60)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
print("\n=== Multiple Sequences (with Sampling) ===")
print(f"Prompt: {prompt}")
print(f"Generating {num_return_sequences} sequences...")
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=max_length,
max_new_tokens=max_new_tokens,
do_sample=True,
num_return_sequences=num_sequences,
temperature=0.9,
temperature=0.8,
top_p=0.95,
pad_token_id=tokenizer.eos_token_id,
num_return_sequences=num_return_sequences,
pad_token_id=tokenizer.pad_token_id
)
print(f"\nPrompt: {prompt}\n")
for i, output in enumerate(outputs, 1):
text = tokenizer.decode(output, skip_special_tokens=True)
print(f"\n--- Sequence {i} ---\n{text}\n")
for i, output in enumerate(outputs):
generated_text = tokenizer.decode(output, skip_special_tokens=True)
print(f"\nSequence {i+1}: {generated_text}")
print()
def generate_with_config(model, tokenizer, prompt):
"""Use GenerationConfig for reusable configuration."""
print("\n=== Using GenerationConfig ===")
print(f"Prompt: {prompt}")
# Create a generation config
generation_config = GenerationConfig(
max_new_tokens=50,
do_sample=True,
temperature=0.7,
top_p=0.9,
top_k=50,
repetition_penalty=1.2,
no_repeat_ngram_size=3,
pad_token_id=tokenizer.pad_token_id
)
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, generation_config=generation_config)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated: {generated_text}\n")
def compare_temperatures(model, tokenizer, prompt, max_new_tokens=50):
"""Compare different temperature settings."""
print("\n=== Temperature Comparison ===")
print(f"Prompt: {prompt}\n")
temperatures = [0.3, 0.7, 1.0, 1.5]
for temp in temperatures:
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=temp,
top_p=0.9,
pad_token_id=tokenizer.pad_token_id
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Temperature {temp}: {generated_text}\n")
def main():
parser = argparse.ArgumentParser(description="Text generation with various strategies")
parser.add_argument(
"--model",
type=str,
default="gpt2",
help="Model name or path",
)
parser.add_argument(
"--prompt",
type=str,
required=True,
help="Input prompt for generation",
)
parser.add_argument(
"--strategy",
type=str,
default="all",
choices=["greedy", "beam", "sampling", "top_k_top_p", "multiple", "all"],
help="Generation strategy to use",
)
parser.add_argument(
"--max-length",
type=int,
default=100,
help="Maximum number of new tokens to generate",
)
parser.add_argument(
"--device",
type=str,
default="auto",
help="Device (cuda, cpu, or auto)",
)
parser.add_argument(
"--temperature",
type=float,
default=0.8,
help="Sampling temperature",
)
parser.add_argument(
"--quantize",
action="store_true",
help="Use 8-bit quantization",
)
args = parser.parse_args()
print("=" * 60)
print("Text Generation Demo")
print("=" * 60)
print(f"Model: {args.model}")
print(f"Strategy: {args.strategy}")
print(f"Max length: {args.max_length}")
print(f"Device: {args.device}")
print("=" * 60)
"""Run all generation examples."""
print("=" * 70)
print("Text Generation Examples")
print("=" * 70)
# Load model and tokenizer
print("\nLoading model...")
model, tokenizer = load_model_and_tokenizer("gpt2")
if args.device == "auto":
device_map = "auto"
device = None
else:
device_map = None
device = args.device
# Example prompts
story_prompt = "Once upon a time in a distant galaxy"
factual_prompt = "The three branches of the US government are"
model_kwargs = {"device_map": device_map} if device_map else {}
# Demonstrate different strategies
generate_with_greedy(model, tokenizer, story_prompt)
generate_with_beam_search(model, tokenizer, factual_prompt)
generate_with_sampling(model, tokenizer, story_prompt)
generate_multiple_sequences(model, tokenizer, story_prompt, num_return_sequences=3)
generate_with_config(model, tokenizer, story_prompt)
compare_temperatures(model, tokenizer, story_prompt)
if args.quantize:
print("Using 8-bit quantization...")
model_kwargs["load_in_8bit"] = True
model = AutoModelForCausalLM.from_pretrained(args.model, **model_kwargs)
tokenizer = AutoTokenizer.from_pretrained(args.model)
if device and not device_map:
model = model.to(device)
print(f"Model loaded on: {model.device if hasattr(model, 'device') else 'multiple devices'}")
# Generate based on strategy
strategies = {
"greedy": lambda: generate_with_greedy(model, tokenizer, args.prompt, args.max_length),
"beam": lambda: generate_with_beam_search(model, tokenizer, args.prompt, args.max_length),
"sampling": lambda: generate_with_sampling(model, tokenizer, args.prompt, args.max_length, args.temperature),
"top_k_top_p": lambda: generate_with_top_k_top_p(model, tokenizer, args.prompt, args.max_length),
"multiple": lambda: generate_multiple(model, tokenizer, args.prompt, args.max_length),
}
if args.strategy == "all":
for strategy_fn in strategies.values():
strategy_fn()
else:
strategies[args.strategy]()
print("\n" + "=" * 60)
print("Generation complete!")
print("=" * 60)
print("=" * 70)
print("All generation examples completed!")
print("=" * 70)
if __name__ == "__main__":

View File

@@ -1,105 +1,132 @@
#!/usr/bin/env python3
"""
Quick inference script using Transformers pipelines.
Quick inference using Transformers pipelines.
This script demonstrates how to use various pipeline tasks for quick inference
without manually managing models, tokenizers, or preprocessing.
Usage:
python quick_inference.py --task text-generation --model gpt2 --input "Hello world"
python quick_inference.py --task sentiment-analysis --input "I love this!"
This script demonstrates how to quickly use pre-trained models for inference
across various tasks using the pipeline API.
"""
import argparse
from transformers import pipeline, infer_device
from transformers import pipeline
def text_classification_example():
"""Sentiment analysis example."""
print("=== Text Classification ===")
classifier = pipeline("text-classification")
result = classifier("I love using Transformers! It makes NLP so easy.")
print(f"Result: {result}\n")
def named_entity_recognition_example():
"""Named Entity Recognition example."""
print("=== Named Entity Recognition ===")
ner = pipeline("token-classification", aggregation_strategy="simple")
text = "My name is Sarah and I work at Microsoft in Seattle"
entities = ner(text)
for entity in entities:
print(f"{entity['word']}: {entity['entity_group']} (score: {entity['score']:.3f})")
print()
def question_answering_example():
"""Question Answering example."""
print("=== Question Answering ===")
qa = pipeline("question-answering")
context = "Paris is the capital and most populous city of France. It is located in northern France."
question = "What is the capital of France?"
answer = qa(question=question, context=context)
print(f"Question: {question}")
print(f"Answer: {answer['answer']} (score: {answer['score']:.3f})\n")
def text_generation_example():
"""Text generation example."""
print("=== Text Generation ===")
generator = pipeline("text-generation", model="gpt2")
prompt = "Once upon a time in a land far away"
generated = generator(prompt, max_length=50, num_return_sequences=1)
print(f"Prompt: {prompt}")
print(f"Generated: {generated[0]['generated_text']}\n")
def summarization_example():
"""Text summarization example."""
print("=== Summarization ===")
summarizer = pipeline("summarization")
article = """
The Transformers library provides thousands of pretrained models to perform tasks
on texts such as classification, information extraction, question answering,
summarization, translation, text generation, etc in over 100 languages. Its aim
is to make cutting-edge NLP easier to use for everyone. The library provides APIs
to quickly download and use pretrained models on a given text, fine-tune them on
your own datasets then share them with the community on the model hub.
"""
summary = summarizer(article, max_length=50, min_length=25, do_sample=False)
print(f"Summary: {summary[0]['summary_text']}\n")
def translation_example():
"""Translation example."""
print("=== Translation ===")
translator = pipeline("translation_en_to_fr")
text = "Hello, how are you today?"
translation = translator(text)
print(f"English: {text}")
print(f"French: {translation[0]['translation_text']}\n")
def zero_shot_classification_example():
"""Zero-shot classification example."""
print("=== Zero-Shot Classification ===")
classifier = pipeline("zero-shot-classification")
text = "This is a breaking news story about a major earthquake."
candidate_labels = ["politics", "sports", "science", "breaking news"]
result = classifier(text, candidate_labels)
print(f"Text: {text}")
print("Predictions:")
for label, score in zip(result['labels'], result['scores']):
print(f" {label}: {score:.3f}")
print()
def image_classification_example():
"""Image classification example (requires PIL)."""
print("=== Image Classification ===")
try:
from PIL import Image
import requests
classifier = pipeline("image-classification")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
predictions = classifier(image)
print("Top predictions:")
for pred in predictions[:3]:
print(f" {pred['label']}: {pred['score']:.3f}")
print()
except ImportError:
print("PIL not installed. Skipping image classification example.\n")
def main():
parser = argparse.ArgumentParser(description="Quick inference with Transformers pipelines")
parser.add_argument(
"--task",
type=str,
required=True,
help="Pipeline task (text-generation, sentiment-analysis, question-answering, etc.)",
)
parser.add_argument(
"--model",
type=str,
default=None,
help="Model name or path (default: use task default)",
)
parser.add_argument(
"--input",
type=str,
required=True,
help="Input text for inference",
)
parser.add_argument(
"--context",
type=str,
default=None,
help="Context for question-answering tasks",
)
parser.add_argument(
"--max-length",
type=int,
default=50,
help="Maximum generation length",
)
parser.add_argument(
"--device",
type=str,
default=None,
help="Device (cuda, cpu, or auto-detect)",
)
"""Run all examples."""
print("Transformers Quick Inference Examples")
print("=" * 50 + "\n")
args = parser.parse_args()
# Text tasks
text_classification_example()
named_entity_recognition_example()
question_answering_example()
text_generation_example()
summarization_example()
translation_example()
zero_shot_classification_example()
# Auto-detect device if not specified
if args.device is None:
device = infer_device()
else:
device = args.device
# Vision task (optional)
image_classification_example()
print(f"Using device: {device}")
print(f"Task: {args.task}")
print(f"Model: {args.model or 'default'}")
print("-" * 50)
# Create pipeline
pipe = pipeline(
args.task,
model=args.model,
device=device,
)
# Run inference based on task
if args.task == "question-answering":
if args.context is None:
print("Error: --context required for question-answering")
return
result = pipe(question=args.input, context=args.context)
print(f"Question: {args.input}")
print(f"Context: {args.context}")
print(f"\nAnswer: {result['answer']}")
print(f"Score: {result['score']:.4f}")
elif args.task == "text-generation":
result = pipe(args.input, max_length=args.max_length)
print(f"Prompt: {args.input}")
print(f"\nGenerated: {result[0]['generated_text']}")
elif args.task in ["sentiment-analysis", "text-classification"]:
result = pipe(args.input)
print(f"Text: {args.input}")
print(f"\nLabel: {result[0]['label']}")
print(f"Score: {result[0]['score']:.4f}")
else:
# Generic handling for other tasks
result = pipe(args.input)
print(f"Input: {args.input}")
print(f"\nResult: {result}")
print("=" * 50)
print("All examples completed!")
if __name__ == "__main__":