Reduce model size and speed up inference through quantization.
## Quantization Overview
### Precision Levels | Precision | Size | Speed | Quality | |-----------|------|-------|---------| | FP32 | 100% | Baseline | 100% | | FP16 | 50% | 2x | ~100% | | INT8 | 25% | 4x | 98-99% | | INT4 | 12.5% | 8x | 95-98% |
### Post-Training Quantization (PTQ) ```python from transformers import AutoModelForCausalLM import torch
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
# Dynamic quantization (easiest) quantized_model = torch.quantization.quantize_dynamic( model, {torch.nn.Linear}, dtype=torch.qint8 )
# Static quantization (better quality) model.qconfig = torch.quantization.get_default_qconfig('fbgemm') torch.quantization.prepare(model, inplace=True) # Run calibration data through model torch.quantization.convert(model, inplace=True) ```
### GPTQ Quantization ```python from auto_gptq import AutoGPTQForCausalLM
# Load pre-quantized model model = AutoGPTQForCausalLM.from_quantized( "TheBloke/Llama-2-7B-GPTQ", device="cuda:0", use_safetensors=True )
# 4-bit inference with minimal quality loss output = model.generate(input_ids) ```
### AWQ (Activation-Aware Quantization) Benefits over GPTQ: - Faster inference - Better quality preservation - Activation-aware scaling