Editing
Model Compression
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> {{BloomIntro}} Model compression and quantization are techniques for reducing the size, memory footprint, and computational cost of neural networks while preserving as much of their predictive performance as possible. Large AI models like LLMs, vision transformers, and generative models require enormous memory and compute to run β making them impractical for edge devices, real-time applications, or cost-sensitive deployments. Compression enables AI capabilities to run on smartphones, embedded systems, and low-cost servers. </div> __TOC__ <div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Remembering</span> == * '''Model compression''' β Umbrella term for techniques that reduce a model's size and/or computational requirements. * '''Quantization''' β Reducing the bit-width of model weights and/or activations (FP32 β INT8 β INT4 β INT2). * '''Post-training quantization (PTQ)''' β Quantizing a trained model without further training; fast but may reduce accuracy. * '''Quantization-aware training (QAT)''' β Simulating quantization during training, allowing the model to adapt; better accuracy than PTQ. * '''Weight quantization''' β Quantizing model weights; does not require knowing activations at inference time. * '''Activation quantization''' β Quantizing intermediate activations; requires calibration on representative data. * '''Pruning''' β Removing unnecessary parameters (weights, neurons, attention heads, layers) from a trained network. * '''Structured pruning''' β Removing entire neurons, filters, or attention heads; produces hardware-friendly sparse models. * '''Unstructured pruning''' β Setting individual weights to zero without changing the tensor shape; high sparsity but limited hardware benefit without sparse compute support. * '''Knowledge distillation''' β Training a smaller "student" model to mimic the behavior of a larger "teacher" model. * '''LoRA (Low-Rank Adaptation)''' β Decomposes weight updates into low-rank matrices; used for efficient fine-tuning and as a form of compression. * '''GPTQ''' β A post-training quantization method specifically for GPT-style LLMs; achieves INT4 with minimal quality loss. * '''AWQ (Activation-aware Weight Quantization)''' β Quantization that preserves important weights identified by activation magnitude. * '''bitsandbytes''' β A Python library providing efficient INT8 and INT4 quantization for transformer models. </div> <div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Understanding</span> == Neural network weights are typically stored as 32-bit floating point numbers (FP32). Each weight takes 4 bytes. A 7-billion parameter model requires ~28GB in FP32 β too large for a single consumer GPU. Quantization to INT8 halves this to ~7GB; INT4 reduces to ~3.5GB. This makes the difference between "runs only on expensive server hardware" and "runs on a gaming laptop." '''The quality vs. size trade-off''': Quantization is lossy β replacing precise floating point values with lower-precision integers introduces rounding errors. The key insight from research is that transformer models are surprisingly robust to aggressive quantization of weights (INT4βINT8), especially when carefully calibrated. Activations are more sensitive and often kept at higher precision. '''Knowledge distillation''': A fundamentally different approach β train a small student model to reproduce the outputs of a large teacher, not just the hard labels. The teacher provides "soft labels" (probability distributions over classes) that contain more information than one-hot labels. Hinton's original insight: a model trained on "90% cat, 10% dog" learns richer representations than one trained on "100% cat." '''Layer sensitivity analysis''': Not all layers are equally sensitive to quantization. Attention layers and first/last layers are typically more sensitive. Mixed-precision quantization keeps sensitive layers at FP16/FP32 and aggressively quantizes others at INT4. </div> <div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Applying</span> == '''LLM quantization with bitsandbytes:''' <syntaxhighlight lang="python"> from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig import torch # 4-bit quantization (NF4) with double quantization β ~4GB for a 7B model quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, # Compute in BF16 despite 4-bit storage bnb_4bit_use_double_quant=True, # Quantize the quantization constants too bnb_4bit_quant_type="nf4", # NF4: optimized for normally distributed weights ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3-8B-Instruct", quantization_config=quantization_config, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B-Instruct") inputs = tokenizer("What is quantum computing?", return_tensors="pt").to("cuda") out = model.generate(**inputs, max_new_tokens=200) print(tokenizer.decode(out[0])) </syntaxhighlight> '''Knowledge distillation:''' <syntaxhighlight lang="python"> import torch.nn.functional as F def distillation_loss(student_logits, teacher_logits, true_labels, T=4.0, alpha=0.7): """Combined distillation + task loss.""" # Soft target loss: KL(teacher_soft || student_soft) at temperature T soft_teacher = F.softmax(teacher_logits / T, dim=-1) soft_student = F.log_softmax(student_logits / T, dim=-1) distill = F.kl_div(soft_student, soft_teacher, reduction='batchmean') * (T ** 2) # Hard target loss: standard cross-entropy task = F.cross_entropy(student_logits, true_labels) return alpha * distill + (1 - alpha) * task </syntaxhighlight> ; Compression technique selection : '''LLM, limited VRAM''' β GPTQ (INT4), AWQ, GGUF (llama.cpp) : '''Fine-tuning large model''' β LoRA + QLoRA (4-bit base + LoRA adapters) : '''Edge deployment (vision)''' β INT8 PTQ with TensorRT or ONNX Runtime : '''Creating a smaller model''' β Knowledge distillation (teacherβstudent) : '''Redundant model structure''' β Structured pruning (remove attention heads, layers) </div> <div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Analyzing</span> == {| class="wikitable" |+ Compression Method Comparison ! Method !! Quality Loss !! Speed Gain !! Size Reduction !! Retraining Needed |- | FP32 β INT8 PTQ || Low (<1%) || 2-4Γ || 4Γ || No |- | FP32 β INT4 (GPTQ/AWQ) || Low-moderate || 4-6Γ || 8Γ || No |- | Knowledge distillation || Moderate || Model-dependent || 10-100Γ || Yes (student) |- | Structured pruning || Moderate || 2-4Γ || 2-10Γ || Yes (fine-tune) |- | LoRA fine-tuning || None (for fine-tuning) || None (inference same) || ~0.5% extra || No |} '''Failure modes''': Catastrophic quantization β certain layers or operations are extremely sensitive; naive INT4 quantization collapses model quality. Calibration set mismatch β PTQ calibrated on wrong data distribution produces poor results on target domain. Student-teacher mismatch β if student capacity is too small to approximate teacher, distillation adds noise not signal. Speedup-accuracy assumptions β INT8 is only faster if hardware has INT8 compute units (most server GPUs and NPUs do; some don't). </div> <div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Evaluating</span> == Evaluation must cover: # '''Quality preservation''': measure perplexity for LLMs, accuracy for classifiers β compare quantized vs. FP32 baseline. # '''Latency and throughput''': measure tokens/second or ms/inference on target hardware. # '''Memory footprint''': measure peak VRAM usage during inference. # '''Task-specific degradation''': some tasks (reasoning, math) are more sensitive to quantization than others (summarization, translation). Expert practitioners evaluate on at least 3 representative downstream tasks, not just perplexity. </div> <div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Creating</span> == Designing a model compression pipeline: # Profile the base model: measure latency, memory, and quality on target hardware. # Set compression targets: e.g., "must fit in 8GB VRAM, latency <200ms". # Apply quantization first: GPTQ or AWQ for LLMs; TensorRT INT8 for vision models. # If quality gap remains, apply structured pruning to remove redundant attention heads. # Consider distillation if larger compression needed: train a student model 3-10Γ smaller. # Final validation: measure compressed model on all target benchmarks; ensure quality degradation within acceptable bounds. # Deploy: GGUF for CPU, ONNX for cross-platform, TensorRT for NVIDIA production. [[Category:Artificial Intelligence]] [[Category:Deep Learning]] [[Category:AI Infrastructure]] </div>
Summary:
Please note that all contributions to BloomWiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
BloomWiki:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Template used on this page:
Template:BloomIntro
(
edit
)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information