Editing
Model Compression
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== <span style="color: #FFFFFF;">Remembering</span> == * '''Model compression''' β Umbrella term for techniques that reduce a model's size and/or computational requirements. * '''Quantization''' β Reducing the bit-width of model weights and/or activations (FP32 β INT8 β INT4 β INT2). * '''Post-training quantization (PTQ)''' β Quantizing a trained model without further training; fast but may reduce accuracy. * '''Quantization-aware training (QAT)''' β Simulating quantization during training, allowing the model to adapt; better accuracy than PTQ. * '''Weight quantization''' β Quantizing model weights; does not require knowing activations at inference time. * '''Activation quantization''' β Quantizing intermediate activations; requires calibration on representative data. * '''Pruning''' β Removing unnecessary parameters (weights, neurons, attention heads, layers) from a trained network. * '''Structured pruning''' β Removing entire neurons, filters, or attention heads; produces hardware-friendly sparse models. * '''Unstructured pruning''' β Setting individual weights to zero without changing the tensor shape; high sparsity but limited hardware benefit without sparse compute support. * '''Knowledge distillation''' β Training a smaller "student" model to mimic the behavior of a larger "teacher" model. * '''LoRA (Low-Rank Adaptation)''' β Decomposes weight updates into low-rank matrices; used for efficient fine-tuning and as a form of compression. * '''GPTQ''' β A post-training quantization method specifically for GPT-style LLMs; achieves INT4 with minimal quality loss. * '''AWQ (Activation-aware Weight Quantization)''' β Quantization that preserves important weights identified by activation magnitude. * '''bitsandbytes''' β A Python library providing efficient INT8 and INT4 quantization for transformer models. </div> <div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
Summary:
Please note that all contributions to BloomWiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
BloomWiki:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information