Files
claude-scientific-skills/scientific-packages/transformers/references/quantization.md
2025-10-19 14:12:02 -07:00

13 KiB

Model Quantization Guide

Comprehensive guide to reducing model memory footprint through quantization while maintaining accuracy.

Overview

Quantization reduces memory requirements by storing model weights in lower precision formats (int8, int4) instead of full precision (float32). This enables:

  • Running larger models on limited hardware
  • Faster inference (reduced memory bandwidth)
  • Lower deployment costs
  • Enabling fine-tuning of models that wouldn't fit in memory

Tradeoffs:

  • Slight accuracy loss (typically < 1-2%)
  • Initial quantization overhead
  • Some methods require calibration data

Quick Comparison

Method Precision Speed Accuracy Fine-tuning Hardware Setup
Bitsandbytes 4/8-bit Fast High Yes (PEFT) CUDA, CPU Easy
GPTQ 2-8-bit Very Fast High Limited CUDA, ROCm, Metal Medium
AWQ 4-bit Very Fast High Yes (PEFT) CUDA, ROCm Medium
GGUF 1-8-bit Medium Variable No CPU-optimized Easy
HQQ 1-8-bit Fast High Yes Multi-platform Medium

Bitsandbytes (BnB)

On-the-fly quantization with excellent PEFT fine-tuning support.

8-bit Quantization

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    load_in_8bit=True,              # Enable 8-bit quantization
    device_map="auto",              # Automatic device placement
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

# Use normally
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=50)

Memory Savings:

  • 7B model: ~14GB → ~7GB (50% reduction)
  • 13B model: ~26GB → ~13GB
  • 70B model: ~140GB → ~70GB

Characteristics:

  • Fast inference
  • Minimal accuracy loss
  • Works with PEFT (LoRA, QLoRA)
  • Supports CPU and CUDA GPUs

4-bit Quantization (QLoRA)

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                      # Enable 4-bit quantization
    bnb_4bit_quant_type="nf4",              # Quantization type ("nf4" or "fp4")
    bnb_4bit_compute_dtype=torch.float16,   # Computation dtype
    bnb_4bit_use_double_quant=True,         # Nested quantization for more savings
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)

Memory Savings:

  • 7B model: ~14GB → ~4GB (70% reduction)
  • 13B model: ~26GB → ~7GB
  • 70B model: ~140GB → ~35GB

Quantization Types:

  • nf4: Normal Float 4 (recommended, better quality)
  • fp4: Float Point 4 (slightly more memory efficient)

Compute Dtype:

# For better quality
bnb_4bit_compute_dtype=torch.float16

# For best performance on Ampere+ GPUs
bnb_4bit_compute_dtype=torch.bfloat16

Double Quantization:

# Enable for additional ~0.4 bits/param savings
bnb_4bit_use_double_quant=True  # Quantize the quantization constants

Fine-tuning with QLoRA

from transformers import AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

# Load quantized model
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)

# Prepare for training
model = prepare_model_for_kbit_training(model)

# Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

# Train normally
trainer = Trainer(model=model, args=training_args, ...)
trainer.train()

GPTQ

Post-training quantization requiring calibration, optimized for inference speed.

Loading GPTQ Models

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

# Load pre-quantized GPTQ model
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-7B-GPTQ",         # Pre-quantized model
    device_map="auto",
    revision="gptq-4bit-32g-actorder_True",  # Specific quantization config
)

# Or quantize yourself
gptq_config = GPTQConfig(
    bits=4,                              # 2, 3, 4, 8 bits
    dataset="c4",                        # Calibration dataset
    tokenizer=tokenizer,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    device_map="auto",
    quantization_config=gptq_config,
)

# Save quantized model
model.save_pretrained("llama-2-7b-gptq")

Configuration Options:

gptq_config = GPTQConfig(
    bits=4,                              # Quantization bits
    group_size=128,                      # Group size for quantization (128, 32, -1)
    dataset="c4",                        # Calibration dataset
    desc_act=False,                      # Activation order (can improve accuracy)
    sym=True,                            # Symmetric quantization
    damp_percent=0.1,                    # Dampening factor
)

Characteristics:

  • Fastest inference among quantization methods
  • Requires one-time calibration (slow)
  • Best when using pre-quantized models from Hub
  • Limited fine-tuning support
  • Excellent for production deployment

AWQ (Activation-aware Weight Quantization)

Protects important weights for better quality.

Loading AWQ Models

from transformers import AutoModelForCausalLM, AwqConfig

# Load pre-quantized AWQ model
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-7B-AWQ",
    device_map="auto",
)

# Or quantize yourself
awq_config = AwqConfig(
    bits=4,                              # 4-bit quantization
    group_size=128,                      # Quantization group size
    zero_point=True,                     # Use zero-point quantization
    version="GEMM",                      # Quantization version
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=awq_config,
    device_map="auto",
)

Characteristics:

  • Better accuracy than GPTQ at same bit width
  • Excellent inference speed
  • Supports PEFT fine-tuning
  • Requires calibration data

Fine-tuning AWQ Models

from peft import LoraConfig, get_peft_model

# AWQ models support LoRA fine-tuning
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
trainer = Trainer(model=model, ...)
trainer.train()

GGUF (GGML Format)

CPU-optimized quantization format, popular in llama.cpp ecosystem.

Using GGUF Models

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load GGUF model
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-7B-GGUF",
    gguf_file="llama-2-7b.Q4_K_M.gguf",  # Specific quantization file
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-7B-GGUF")

GGUF Quantization Types:

  • Q4_0: 4-bit, smallest, lowest quality
  • Q4_K_M: 4-bit, medium quality (recommended)
  • Q5_K_M: 5-bit, good quality
  • Q6_K: 6-bit, high quality
  • Q8_0: 8-bit, very high quality

Characteristics:

  • Optimized for CPU inference
  • Wide range of bit depths (1-8)
  • Good for Apple Silicon (M1/M2)
  • No fine-tuning support
  • Excellent for local/edge deployment

HQQ (Half-Quadratic Quantization)

Flexible quantization with good accuracy retention.

Using HQQ

from transformers import AutoModelForCausalLM, HqqConfig

hqq_config = HqqConfig(
    nbits=4,                             # Quantization bits
    group_size=64,                       # Group size
    quant_zero=False,                    # Quantize zero point
    quant_scale=False,                   # Quantize scale
    axis=0,                              # Quantization axis
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=hqq_config,
    device_map="auto",
)

Characteristics:

  • Very fast quantization
  • No calibration data needed
  • Support for 1-8 bits
  • Can serialize/deserialize
  • Good accuracy vs size tradeoff

Choosing a Quantization Method

Decision Tree

For inference only:

  1. Need fastest inference? → GPTQ or AWQ (use pre-quantized models)
  2. CPU-only deployment? → GGUF
  3. Want easiest setup? → Bitsandbytes 8-bit
  4. Need extreme compression? → GGUF Q4_0 or HQQ 2-bit

For fine-tuning:

  1. Limited VRAM? → QLoRA (BnB 4-bit + LoRA)
  2. Want best accuracy? → Bitsandbytes 8-bit + LoRA
  3. Need very large models? → QLoRA with double quantization

For production:

  1. Latency-critical? → GPTQ or AWQ
  2. Cost-optimized? → Bitsandbytes 8-bit
  3. CPU deployment? → GGUF

Memory Requirements

Approximate memory for Llama-2 7B model:

Method Memory vs FP16
FP32 28GB 2x
FP16 / BF16 14GB 1x
8-bit (BnB) 7GB 0.5x
4-bit (QLoRA) 3.5GB 0.25x
4-bit Double Quant 3GB 0.21x
GPTQ 4-bit 4GB 0.29x
AWQ 4-bit 4GB 0.29x

Note: Add ~1-2GB for inference activations, KV cache, and framework overhead.

Best Practices

For Training

# QLoRA recommended configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,  # BF16 if available
    bnb_4bit_use_double_quant=True,
)

# LoRA configuration
lora_config = LoraConfig(
    r=16,                                    # Rank (8, 16, 32, 64)
    lora_alpha=32,                           # Scaling (typically 2*r)
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

For Inference

# High-speed inference
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-7B-GPTQ",
    device_map="auto",
    torch_dtype=torch.float16,           # Use FP16 for activations
)

# Balanced quality/speed
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    load_in_8bit=True,
    device_map="auto",
)

# Maximum compression
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
    ),
    device_map="auto",
)

Multi-GPU Setups

# Automatically distribute across GPUs
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    load_in_4bit=True,
    device_map="auto",                   # Automatic distribution
    max_memory={0: "20GB", 1: "20GB"},   # Optional: limit per GPU
)

# Manual device map
device_map = {
    "model.embed_tokens": 0,
    "model.layers.0": 0,
    "model.layers.1": 0,
    # ... distribute layers ...
    "model.norm": 1,
    "lm_head": 1,
}

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    load_in_4bit=True,
    device_map=device_map,
)

Troubleshooting

Issue: OOM during quantization

# Solution: Use low_cpu_mem_usage
model = AutoModelForCausalLM.from_pretrained(
    "model-name",
    quantization_config=config,
    device_map="auto",
    low_cpu_mem_usage=True,              # Reduce CPU memory during loading
)

Issue: Slow quantization

# GPTQ/AWQ take time to calibrate
# Solution: Use pre-quantized models from Hub
model = AutoModelForCausalLM.from_pretrained("TheBloke/Model-GPTQ")

# Or use BnB for instant quantization
model = AutoModelForCausalLM.from_pretrained("model-name", load_in_4bit=True)

Issue: Poor quality after quantization

# Try different quantization types
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # Try "nf4" instead of "fp4"
    bnb_4bit_compute_dtype=torch.bfloat16,  # Use BF16 if available
)

# Or use 8-bit instead of 4-bit
model = AutoModelForCausalLM.from_pretrained("model-name", load_in_8bit=True)

Issue: Can't fine-tune quantized model

# Ensure using compatible quantization method
from peft import prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model)

# Only BnB and AWQ support PEFT fine-tuning
# GPTQ has limited support, GGUF doesn't support fine-tuning

Performance Benchmarks

Approximate generation speed (tokens/sec) for Llama-2 7B on A100 40GB:

Method Speed Memory
FP16 100 tok/s 14GB
8-bit 90 tok/s 7GB
4-bit QLoRA 70 tok/s 4GB
GPTQ 4-bit 95 tok/s 4GB
AWQ 4-bit 95 tok/s 4GB

Note: Actual performance varies by hardware, sequence length, and batch size.

Resources

For task-specific quantization examples, see training_guide.md.