Editing Model Compression and Quantization (section)

== <span style="color: #FFFFFF;">Understanding</span> ==
Neural network weights are typically stored as 32-bit floating point numbers (FP32). Each weight takes 4 bytes. A 7-billion parameter model requires ~28GB in FP32 — too large for a single consumer GPU. Quantization to INT8 halves this to ~7GB; INT4 reduces to ~3.5GB. This makes the difference between "runs only on expensive server hardware" and "runs on a gaming laptop."

**The quality vs. size trade-off**: Quantization is lossy — replacing precise floating point values with lower-precision integers introduces rounding errors. The key insight from research is that transformer models are surprisingly robust to aggressive quantization of weights (INT4–INT8), especially when carefully calibrated. Activations are more sensitive and often kept at higher precision.

**Knowledge distillation**: A fundamentally different approach — train a small student model to reproduce the outputs of a large teacher, not just the hard labels. The teacher provides "soft labels" (probability distributions over classes) that contain more information than one-hot labels. Hinton's original insight: a model trained on "90% cat, 10% dog" learns richer representations than one trained on "100% cat."

**Layer sensitivity analysis**: Not all layers are equally sensitive to quantization. Attention layers and first/last layers are typically more sensitive. Mixed-precision quantization keeps sensitive layers at FP16/FP32 and aggressively quantizes others at INT4.
</div>

<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">