Editing Model Compression (section)

== <span style="color: #FFFFFF;">Remembering</span> ==
* '''Model compression''' — Umbrella term for techniques that reduce a model's size and/or computational requirements.
* '''Quantization''' — Reducing the bit-width of model weights and/or activations (FP32 → INT8 → INT4 → INT2).
* '''Post-training quantization (PTQ)''' — Quantizing a trained model without further training; fast but may reduce accuracy.
* '''Quantization-aware training (QAT)''' — Simulating quantization during training, allowing the model to adapt; better accuracy than PTQ.
* '''Weight quantization''' — Quantizing model weights; does not require knowing activations at inference time.
* '''Activation quantization''' — Quantizing intermediate activations; requires calibration on representative data.
* '''Pruning''' — Removing unnecessary parameters (weights, neurons, attention heads, layers) from a trained network.
* '''Structured pruning''' — Removing entire neurons, filters, or attention heads; produces hardware-friendly sparse models.
* '''Unstructured pruning''' — Setting individual weights to zero without changing the tensor shape; high sparsity but limited hardware benefit without sparse compute support.
* '''Knowledge distillation''' — Training a smaller "student" model to mimic the behavior of a larger "teacher" model.
* '''LoRA (Low-Rank Adaptation)''' — Decomposes weight updates into low-rank matrices; used for efficient fine-tuning and as a form of compression.
* '''GPTQ''' — A post-training quantization method specifically for GPT-style LLMs; achieves INT4 with minimal quality loss.
* '''AWQ (Activation-aware Weight Quantization)''' — Quantization that preserves important weights identified by activation magnitude.
* '''bitsandbytes''' — A Python library providing efficient INT8 and INT4 quantization for transformer models.
</div>

<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">