Model Compression
How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?
Model compression and quantization are techniques for reducing the size, memory footprint, and computational cost of neural networks while preserving as much of their predictive performance as possible. Large AI models like LLMs, vision transformers, and generative models require enormous memory and compute to run — making them impractical for edge devices, real-time applications, or cost-sensitive deployments. Compression enables AI capabilities to run on smartphones, embedded systems, and low-cost servers.
Remembering[edit]
- Model compression — Umbrella term for techniques that reduce a model's size and/or computational requirements.
- Quantization — Reducing the bit-width of model weights and/or activations (FP32 → INT8 → INT4 → INT2).
- Post-training quantization (PTQ) — Quantizing a trained model without further training; fast but may reduce accuracy.
- Quantization-aware training (QAT) — Simulating quantization during training, allowing the model to adapt; better accuracy than PTQ.
- Weight quantization — Quantizing model weights; does not require knowing activations at inference time.
- Activation quantization — Quantizing intermediate activations; requires calibration on representative data.
- Pruning — Removing unnecessary parameters (weights, neurons, attention heads, layers) from a trained network.
- Structured pruning — Removing entire neurons, filters, or attention heads; produces hardware-friendly sparse models.
- Unstructured pruning — Setting individual weights to zero without changing the tensor shape; high sparsity but limited hardware benefit without sparse compute support.
- Knowledge distillation — Training a smaller "student" model to mimic the behavior of a larger "teacher" model.
- LoRA (Low-Rank Adaptation) — Decomposes weight updates into low-rank matrices; used for efficient fine-tuning and as a form of compression.
- GPTQ — A post-training quantization method specifically for GPT-style LLMs; achieves INT4 with minimal quality loss.
- AWQ (Activation-aware Weight Quantization) — Quantization that preserves important weights identified by activation magnitude.
- bitsandbytes — A Python library providing efficient INT8 and INT4 quantization for transformer models.
Understanding[edit]
Neural network weights are typically stored as 32-bit floating point numbers (FP32). Each weight takes 4 bytes. A 7-billion parameter model requires ~28GB in FP32 — too large for a single consumer GPU. Quantization to INT8 halves this to ~7GB; INT4 reduces to ~3.5GB. This makes the difference between "runs only on expensive server hardware" and "runs on a gaming laptop."
The quality vs. size trade-off: Quantization is lossy — replacing precise floating point values with lower-precision integers introduces rounding errors. The key insight from research is that transformer models are surprisingly robust to aggressive quantization of weights (INT4–INT8), especially when carefully calibrated. Activations are more sensitive and often kept at higher precision.
Knowledge distillation: A fundamentally different approach — train a small student model to reproduce the outputs of a large teacher, not just the hard labels. The teacher provides "soft labels" (probability distributions over classes) that contain more information than one-hot labels. Hinton's original insight: a model trained on "90% cat, 10% dog" learns richer representations than one trained on "100% cat."
Layer sensitivity analysis: Not all layers are equally sensitive to quantization. Attention layers and first/last layers are typically more sensitive. Mixed-precision quantization keeps sensitive layers at FP16/FP32 and aggressively quantizes others at INT4.
Applying[edit]
LLM quantization with bitsandbytes: <syntaxhighlight lang="python"> from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig import torch
- 4-bit quantization (NF4) with double quantization — ~4GB for a 7B model
quantization_config = BitsAndBytesConfig(
load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, # Compute in BF16 despite 4-bit storage bnb_4bit_use_double_quant=True, # Quantize the quantization constants too bnb_4bit_quant_type="nf4", # NF4: optimized for normally distributed weights
) model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8B-Instruct", quantization_config=quantization_config, device_map="auto"
) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B-Instruct") inputs = tokenizer("What is quantum computing?", return_tensors="pt").to("cuda") out = model.generate(**inputs, max_new_tokens=200) print(tokenizer.decode(out[0])) </syntaxhighlight>
Knowledge distillation: <syntaxhighlight lang="python"> import torch.nn.functional as F
def distillation_loss(student_logits, teacher_logits, true_labels, T=4.0, alpha=0.7):
"""Combined distillation + task loss.""" # Soft target loss: KL(teacher_soft || student_soft) at temperature T soft_teacher = F.softmax(teacher_logits / T, dim=-1) soft_student = F.log_softmax(student_logits / T, dim=-1) distill = F.kl_div(soft_student, soft_teacher, reduction='batchmean') * (T ** 2) # Hard target loss: standard cross-entropy task = F.cross_entropy(student_logits, true_labels) return alpha * distill + (1 - alpha) * task
</syntaxhighlight>
- Compression technique selection
- LLM, limited VRAM → GPTQ (INT4), AWQ, GGUF (llama.cpp)
- Fine-tuning large model → LoRA + QLoRA (4-bit base + LoRA adapters)
- Edge deployment (vision) → INT8 PTQ with TensorRT or ONNX Runtime
- Creating a smaller model → Knowledge distillation (teacher→student)
- Redundant model structure → Structured pruning (remove attention heads, layers)
Analyzing[edit]
| Method | Quality Loss | Speed Gain | Size Reduction | Retraining Needed |
|---|---|---|---|---|
| FP32 → INT8 PTQ | Low (<1%) | 2-4× | 4× | No |
| FP32 → INT4 (GPTQ/AWQ) | Low-moderate | 4-6× | 8× | No |
| Knowledge distillation | Moderate | Model-dependent | 10-100× | Yes (student) |
| Structured pruning | Moderate | 2-4× | 2-10× | Yes (fine-tune) |
| LoRA fine-tuning | None (for fine-tuning) | None (inference same) | ~0.5% extra | No |
Failure modes: Catastrophic quantization — certain layers or operations are extremely sensitive; naive INT4 quantization collapses model quality. Calibration set mismatch — PTQ calibrated on wrong data distribution produces poor results on target domain. Student-teacher mismatch — if student capacity is too small to approximate teacher, distillation adds noise not signal. Speedup-accuracy assumptions — INT8 is only faster if hardware has INT8 compute units (most server GPUs and NPUs do; some don't).
Evaluating[edit]
Evaluation must cover:
- Quality preservation: measure perplexity for LLMs, accuracy for classifiers — compare quantized vs. FP32 baseline.
- Latency and throughput: measure tokens/second or ms/inference on target hardware.
- Memory footprint: measure peak VRAM usage during inference.
- Task-specific degradation: some tasks (reasoning, math) are more sensitive to quantization than others (summarization, translation). Expert practitioners evaluate on at least 3 representative downstream tasks, not just perplexity.
Creating[edit]
Designing a model compression pipeline:
- Profile the base model: measure latency, memory, and quality on target hardware.
- Set compression targets: e.g., "must fit in 8GB VRAM, latency <200ms".
- Apply quantization first: GPTQ or AWQ for LLMs; TensorRT INT8 for vision models.
- If quality gap remains, apply structured pruning to remove redundant attention heads.
- Consider distillation if larger compression needed: train a student model 3-10× smaller.
- Final validation: measure compressed model on all target benchmarks; ensure quality degradation within acceptable bounds.
- Deploy: GGUF for CPU, ONNX for cross-platform, TensorRT for NVIDIA production.