Model Compression - Revision history

Wordpad: BloomWiki: Model Compression

2026-04-25T01:54:06Z

BloomWiki: Model Compression

← Older revision		Revision as of 01:54, 25 April 2026
Line 1:		Line 1:
			<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
	{{BloomIntro}}		{{BloomIntro}}
	Model compression and quantization are techniques for reducing the size, memory footprint, and computational cost of neural networks while preserving as much of their predictive performance as possible. Large AI models like LLMs, vision transformers, and generative models require enormous memory and compute to run — making them impractical for edge devices, real-time applications, or cost-sensitive deployments. Compression enables AI capabilities to run on smartphones, embedded systems, and low-cost servers.		Model compression and quantization are techniques for reducing the size, memory footprint, and computational cost of neural networks while preserving as much of their predictive performance as possible. Large AI models like LLMs, vision transformers, and generative models require enormous memory and compute to run — making them impractical for edge devices, real-time applications, or cost-sensitive deployments. Compression enables AI capabilities to run on smartphones, embedded systems, and low-cost servers.
			</div>

	== Remembering ==		__TOC__

			<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Remembering</span> ==
	* '''Model compression''' — Umbrella term for techniques that reduce a model's size and/or computational requirements.		* '''Model compression''' — Umbrella term for techniques that reduce a model's size and/or computational requirements.
	* '''Quantization''' — Reducing the bit-width of model weights and/or activations (FP32 → INT8 → INT4 → INT2).		* '''Quantization''' — Reducing the bit-width of model weights and/or activations (FP32 → INT8 → INT4 → INT2).
Line 17:		Line 22:
	* '''AWQ (Activation-aware Weight Quantization)''' — Quantization that preserves important weights identified by activation magnitude.		* '''AWQ (Activation-aware Weight Quantization)''' — Quantization that preserves important weights identified by activation magnitude.
	* '''bitsandbytes''' — A Python library providing efficient INT8 and INT4 quantization for transformer models.		* '''bitsandbytes''' — A Python library providing efficient INT8 and INT4 quantization for transformer models.
			</div>

	== Understanding ==		<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Understanding</span> ==
	Neural network weights are typically stored as 32-bit floating point numbers (FP32). Each weight takes 4 bytes. A 7-billion parameter model requires ~28GB in FP32 — too large for a single consumer GPU. Quantization to INT8 halves this to ~7GB; INT4 reduces to ~3.5GB. This makes the difference between "runs only on expensive server hardware" and "runs on a gaming laptop."		Neural network weights are typically stored as 32-bit floating point numbers (FP32). Each weight takes 4 bytes. A 7-billion parameter model requires ~28GB in FP32 — too large for a single consumer GPU. Quantization to INT8 halves this to ~7GB; INT4 reduces to ~3.5GB. This makes the difference between "runs only on expensive server hardware" and "runs on a gaming laptop."

Line 26:		Line 33:

	'''Layer sensitivity analysis''': Not all layers are equally sensitive to quantization. Attention layers and first/last layers are typically more sensitive. Mixed-precision quantization keeps sensitive layers at FP16/FP32 and aggressively quantizes others at INT4.		'''Layer sensitivity analysis''': Not all layers are equally sensitive to quantization. Attention layers and first/last layers are typically more sensitive. Mixed-precision quantization keeps sensitive layers at FP16/FP32 and aggressively quantizes others at INT4.
			</div>

	== Applying ==		<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Applying</span> ==
	'''LLM quantization with bitsandbytes:'''		'''LLM quantization with bitsandbytes:'''
	<syntaxhighlight lang="python">		<syntaxhighlight lang="python">
Line 72:		Line 81:
	: '''Creating a smaller model''' → Knowledge distillation (teacher→student)		: '''Creating a smaller model''' → Knowledge distillation (teacher→student)
	: '''Redundant model structure''' → Structured pruning (remove attention heads, layers)		: '''Redundant model structure''' → Structured pruning (remove attention heads, layers)
			</div>

	== Analyzing ==		<div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Analyzing</span> ==
	{\| class="wikitable"		{\| class="wikitable"
	\|+ Compression Method Comparison		\|+ Compression Method Comparison
Line 90:		Line 101:

	'''Failure modes''': Catastrophic quantization — certain layers or operations are extremely sensitive; naive INT4 quantization collapses model quality. Calibration set mismatch — PTQ calibrated on wrong data distribution produces poor results on target domain. Student-teacher mismatch — if student capacity is too small to approximate teacher, distillation adds noise not signal. Speedup-accuracy assumptions — INT8 is only faster if hardware has INT8 compute units (most server GPUs and NPUs do; some don't).		'''Failure modes''': Catastrophic quantization — certain layers or operations are extremely sensitive; naive INT4 quantization collapses model quality. Calibration set mismatch — PTQ calibrated on wrong data distribution produces poor results on target domain. Student-teacher mismatch — if student capacity is too small to approximate teacher, distillation adds noise not signal. Speedup-accuracy assumptions — INT8 is only faster if hardware has INT8 compute units (most server GPUs and NPUs do; some don't).
			</div>

	== Evaluating ==		<div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Evaluating</span> ==
	Evaluation must cover:		Evaluation must cover:
	# '''Quality preservation''': measure perplexity for LLMs, accuracy for classifiers — compare quantized vs. FP32 baseline.		# '''Quality preservation''': measure perplexity for LLMs, accuracy for classifiers — compare quantized vs. FP32 baseline.
Line 97:		Line 110:
	# '''Memory footprint''': measure peak VRAM usage during inference.		# '''Memory footprint''': measure peak VRAM usage during inference.
	# '''Task-specific degradation''': some tasks (reasoning, math) are more sensitive to quantization than others (summarization, translation). Expert practitioners evaluate on at least 3 representative downstream tasks, not just perplexity.		# '''Task-specific degradation''': some tasks (reasoning, math) are more sensitive to quantization than others (summarization, translation). Expert practitioners evaluate on at least 3 representative downstream tasks, not just perplexity.
			</div>

	== Creating ==		<div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Creating</span> ==
	Designing a model compression pipeline:		Designing a model compression pipeline:
	# Profile the base model: measure latency, memory, and quality on target hardware.		# Profile the base model: measure latency, memory, and quality on target hardware.
Line 111:		Line 126:
	[[Category:Deep Learning]]		[[Category:Deep Learning]]
	[[Category:AI Infrastructure]]		[[Category:AI Infrastructure]]
			</div>

Wordpad: BloomWiki: Model Compression

2026-04-23T14:35:35Z

BloomWiki: Model Compression

← Older revision		Revision as of 14:35, 23 April 2026
Line 92:		Line 92:

	== Evaluating ==		== Evaluating ==
	Evaluation must cover: ~~(1)~~ '''Quality preservation''': measure perplexity for LLMs, accuracy for classifiers — compare quantized vs. FP32 baseline. ~~(2)~~ '''Latency and throughput''': measure tokens/second or ms/inference on target hardware. ~~(3)~~ '''Memory footprint''': measure peak VRAM usage during inference. ~~(4)~~ '''Task-specific degradation''': some tasks (reasoning, math) are more sensitive to quantization than others (summarization, translation). Expert practitioners evaluate on at least 3 representative downstream tasks, not just perplexity.		Evaluation must cover:
			# '''Quality preservation''': measure perplexity for LLMs, accuracy for classifiers — compare quantized vs. FP32 baseline.
			# '''Latency and throughput''': measure tokens/second or ms/inference on target hardware.
			# '''Memory footprint''': measure peak VRAM usage during inference.
			# '''Task-specific degradation''': some tasks (reasoning, math) are more sensitive to quantization than others (summarization, translation). Expert practitioners evaluate on at least 3 representative downstream tasks, not just perplexity.

	== Creating ==		== Creating ==
	Designing a model compression pipeline: ~~(1)~~ Profile the base model: measure latency, memory, and quality on target hardware. ~~(2)~~ Set compression targets: e.g., "must fit in 8GB VRAM, latency <200ms". ~~(3)~~ Apply quantization first: GPTQ or AWQ for LLMs; TensorRT INT8 for vision models. ~~(4)~~ If quality gap remains, apply structured pruning to remove redundant attention heads. ~~(5)~~ Consider distillation if larger compression needed: train a student model 3-10× smaller. ~~(6)~~ Final validation: measure compressed model on all target benchmarks; ensure quality degradation within acceptable bounds. ~~(7)~~ Deploy: GGUF for CPU, ONNX for cross-platform, TensorRT for NVIDIA production.		Designing a model compression pipeline:
			# Profile the base model: measure latency, memory, and quality on target hardware.
			# Set compression targets: e.g., "must fit in 8GB VRAM, latency <200ms".
			# Apply quantization first: GPTQ or AWQ for LLMs; TensorRT INT8 for vision models.
			# If quality gap remains, apply structured pruning to remove redundant attention heads.
			# Consider distillation if larger compression needed: train a student model 3-10× smaller.
			# Final validation: measure compressed model on all target benchmarks; ensure quality degradation within acceptable bounds.
			# Deploy: GGUF for CPU, ONNX for cross-platform, TensorRT for NVIDIA production.

	[[Category:Artificial Intelligence]]		[[Category:Artificial Intelligence]]
	[[Category:Deep Learning]]		[[Category:Deep Learning]]
	[[Category:AI Infrastructure]]		[[Category:AI Infrastructure]]

Wordpad: BloomWiki: Model Compression

2026-04-23T14:20:10Z

BloomWiki: Model Compression

New page

{{BloomIntro}}
Model compression and quantization are techniques for reducing the size, memory footprint, and computational cost of neural networks while preserving as much of their predictive performance as possible. Large AI models like LLMs, vision transformers, and generative models require enormous memory and compute to run — making them impractical for edge devices, real-time applications, or cost-sensitive deployments. Compression enables AI capabilities to run on smartphones, embedded systems, and low-cost servers.

== Remembering ==
* '''Model compression''' — Umbrella term for techniques that reduce a model's size and/or computational requirements.
* '''Quantization''' — Reducing the bit-width of model weights and/or activations (FP32 → INT8 → INT4 → INT2).
* '''Post-training quantization (PTQ)''' — Quantizing a trained model without further training; fast but may reduce accuracy.
* '''Quantization-aware training (QAT)''' — Simulating quantization during training, allowing the model to adapt; better accuracy than PTQ.
* '''Weight quantization''' — Quantizing model weights; does not require knowing activations at inference time.
* '''Activation quantization''' — Quantizing intermediate activations; requires calibration on representative data.
* '''Pruning''' — Removing unnecessary parameters (weights, neurons, attention heads, layers) from a trained network.
* '''Structured pruning''' — Removing entire neurons, filters, or attention heads; produces hardware-friendly sparse models.
* '''Unstructured pruning''' — Setting individual weights to zero without changing the tensor shape; high sparsity but limited hardware benefit without sparse compute support.
* '''Knowledge distillation''' — Training a smaller "student" model to mimic the behavior of a larger "teacher" model.
* '''LoRA (Low-Rank Adaptation)''' — Decomposes weight updates into low-rank matrices; used for efficient fine-tuning and as a form of compression.
* '''GPTQ''' — A post-training quantization method specifically for GPT-style LLMs; achieves INT4 with minimal quality loss.
* '''AWQ (Activation-aware Weight Quantization)''' — Quantization that preserves important weights identified by activation magnitude.
* '''bitsandbytes''' — A Python library providing efficient INT8 and INT4 quantization for transformer models.

== Understanding ==
Neural network weights are typically stored as 32-bit floating point numbers (FP32). Each weight takes 4 bytes. A 7-billion parameter model requires ~28GB in FP32 — too large for a single consumer GPU. Quantization to INT8 halves this to ~7GB; INT4 reduces to ~3.5GB. This makes the difference between "runs only on expensive server hardware" and "runs on a gaming laptop."

'''The quality vs. size trade-off''': Quantization is lossy — replacing precise floating point values with lower-precision integers introduces rounding errors. The key insight from research is that transformer models are surprisingly robust to aggressive quantization of weights (INT4–INT8), especially when carefully calibrated. Activations are more sensitive and often kept at higher precision.

'''Knowledge distillation''': A fundamentally different approach — train a small student model to reproduce the outputs of a large teacher, not just the hard labels. The teacher provides "soft labels" (probability distributions over classes) that contain more information than one-hot labels. Hinton's original insight: a model trained on "90% cat, 10% dog" learns richer representations than one trained on "100% cat."

'''Layer sensitivity analysis''': Not all layers are equally sensitive to quantization. Attention layers and first/last layers are typically more sensitive. Mixed-precision quantization keeps sensitive layers at FP16/FP32 and aggressively quantizes others at INT4.

== Applying ==
'''LLM quantization with bitsandbytes:'''
<syntaxhighlight lang="python">
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# 4-bit quantization (NF4) with double quantization — ~4GB for a 7B model
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16, # Compute in BF16 despite 4-bit storage
bnb_4bit_use_double_quant=True, # Quantize the quantization constants too
bnb_4bit_quant_type="nf4", # NF4: optimized for normally distributed weights
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8B-Instruct",
quantization_config=quantization_config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B-Instruct")
inputs = tokenizer("What is quantum computing?", return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(out[0]))
</syntaxhighlight>

'''Knowledge distillation:'''
<syntaxhighlight lang="python">
import torch.nn.functional as F

def distillation_loss(student_logits, teacher_logits, true_labels, T=4.0, alpha=0.7):
"""Combined distillation + task loss."""
# Soft target loss: KL(teacher_soft || student_soft) at temperature T
soft_teacher = F.softmax(teacher_logits / T, dim=-1)
soft_student = F.log_softmax(student_logits / T, dim=-1)
distill = F.kl_div(soft_student, soft_teacher, reduction='batchmean') * (T ** 2)
# Hard target loss: standard cross-entropy
task = F.cross_entropy(student_logits, true_labels)
return alpha * distill + (1 - alpha) * task
</syntaxhighlight>

; Compression technique selection
: '''LLM, limited VRAM''' → GPTQ (INT4), AWQ, GGUF (llama.cpp)
: '''Fine-tuning large model''' → LoRA + QLoRA (4-bit base + LoRA adapters)
: '''Edge deployment (vision)''' → INT8 PTQ with TensorRT or ONNX Runtime
: '''Creating a smaller model''' → Knowledge distillation (teacher→student)
: '''Redundant model structure''' → Structured pruning (remove attention heads, layers)

== Analyzing ==
{| class="wikitable"
|+ Compression Method Comparison
! Method !! Quality Loss !! Speed Gain !! Size Reduction !! Retraining Needed
|-
| FP32 → INT8 PTQ || Low (<1%) || 2-4× || 4× || No
|-
| FP32 → INT4 (GPTQ/AWQ) || Low-moderate || 4-6× || 8× || No
|-
| Knowledge distillation || Moderate || Model-dependent || 10-100× || Yes (student)
|-
| Structured pruning || Moderate || 2-4× || 2-10× || Yes (fine-tune)
|-
| LoRA fine-tuning || None (for fine-tuning) || None (inference same) || ~0.5% extra || No
|}

'''Failure modes''': Catastrophic quantization — certain layers or operations are extremely sensitive; naive INT4 quantization collapses model quality. Calibration set mismatch — PTQ calibrated on wrong data distribution produces poor results on target domain. Student-teacher mismatch — if student capacity is too small to approximate teacher, distillation adds noise not signal. Speedup-accuracy assumptions — INT8 is only faster if hardware has INT8 compute units (most server GPUs and NPUs do; some don't).

== Evaluating ==
Evaluation must cover: (1) '''Quality preservation''': measure perplexity for LLMs, accuracy for classifiers — compare quantized vs. FP32 baseline. (2) '''Latency and throughput''': measure tokens/second or ms/inference on target hardware. (3) '''Memory footprint''': measure peak VRAM usage during inference. (4) '''Task-specific degradation''': some tasks (reasoning, math) are more sensitive to quantization than others (summarization, translation). Expert practitioners evaluate on at least 3 representative downstream tasks, not just perplexity.

== Creating ==
Designing a model compression pipeline: (1) Profile the base model: measure latency, memory, and quality on target hardware. (2) Set compression targets: e.g., "must fit in 8GB VRAM, latency <200ms". (3) Apply quantization first: GPTQ or AWQ for LLMs; TensorRT INT8 for vision models. (4) If quality gap remains, apply structured pruning to remove redundant attention heads. (5) Consider distillation if larger compression needed: train a student model 3-10× smaller. (6) Final validation: measure compressed model on all target benchmarks; ensure quality degradation within acceptable bounds. (7) Deploy: GGUF for CPU, ONNX for cross-platform, TensorRT for NVIDIA production.

[[Category:Artificial Intelligence]]
[[Category:Deep Learning]]
[[Category:AI Infrastructure]]