Optimization Algorithms: Difference between revisions
BloomWiki: Optimization Algorithms |
BloomWiki: Optimization Algorithms |
||
| (One intermediate revision by the same user not shown) | |||
| Line 1: | Line 1: | ||
<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | |||
{{BloomIntro}} | {{BloomIntro}} | ||
Optimization algorithms in machine learning are the mathematical procedures that adjust model parameters to minimize a loss function. Every machine learning model is trained by an optimizer iterating over data, computing gradients, and updating weights. The choice of optimizer — and its hyperparameters — profoundly affects training speed, final performance, and stability. From classic stochastic gradient descent to adaptive methods like Adam and cutting-edge approaches for distributed training and LLMs, optimization is at the heart of all modern AI. | Optimization algorithms in machine learning are the mathematical procedures that adjust model parameters to minimize a loss function. Every machine learning model is trained by an optimizer iterating over data, computing gradients, and updating weights. The choice of optimizer — and its hyperparameters — profoundly affects training speed, final performance, and stability. From classic stochastic gradient descent to adaptive methods like Adam and cutting-edge approaches for distributed training and LLMs, optimization is at the heart of all modern AI. | ||
</div> | |||
== Remembering == | __TOC__ | ||
<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | |||
== <span style="color: #FFFFFF;">Remembering</span> == | |||
* '''Loss function''' — A function measuring the discrepancy between model predictions and true labels; the objective to minimize. | * '''Loss function''' — A function measuring the discrepancy between model predictions and true labels; the objective to minimize. | ||
* '''Gradient''' — The vector of partial derivatives of the loss with respect to all model parameters; points in the direction of steepest ascent. | * '''Gradient''' — The vector of partial derivatives of the loss with respect to all model parameters; points in the direction of steepest ascent. | ||
| Line 18: | Line 23: | ||
* '''Batch size''' — The number of examples per gradient update; affects gradient variance, memory, and training dynamics. | * '''Batch size''' — The number of examples per gradient update; affects gradient variance, memory, and training dynamics. | ||
* '''Learning rate finder''' — A technique for selecting a good learning rate by increasing it gradually and monitoring loss. | * '''Learning rate finder''' — A technique for selecting a good learning rate by increasing it gradually and monitoring loss. | ||
</div> | |||
== Understanding == | <div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Understanding</span> == | |||
Every ML model is trained by minimizing a loss function L(θ) over parameters θ. The loss surface is a high-dimensional landscape; optimization is the search for a good minimum. | Every ML model is trained by minimizing a loss function L(θ) over parameters θ. The loss surface is a high-dimensional landscape; optimization is the search for a good minimum. | ||
| Line 31: | Line 38: | ||
'''Batch size considerations''': Large batch training is efficient (GPU utilization) but changes gradient dynamics. Linear scaling rule: if you double batch size, double learning rate. Large batch training tends toward sharp minima and may generalize worse — addressed by linear warmup and LARS/LAMB optimizers for very large batches. | '''Batch size considerations''': Large batch training is efficient (GPU utilization) but changes gradient dynamics. Linear scaling rule: if you double batch size, double learning rate. Large batch training tends toward sharp minima and may generalize worse — addressed by linear warmup and LARS/LAMB optimizers for very large batches. | ||
</div> | |||
== Applying == | <div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Applying</span> == | |||
'''Learning rate schedule with warmup and cosine decay:''' | '''Learning rate schedule with warmup and cosine decay:''' | ||
<syntaxhighlight lang="python"> | <syntaxhighlight lang="python"> | ||
| Line 83: | Line 92: | ||
: '''RNN / LSTM''' → Adam + gradient clipping (max_norm=5.0) | : '''RNN / LSTM''' → Adam + gradient clipping (max_norm=5.0) | ||
: '''Fast research iteration''' → Adam with liberal LR (1e-3), no schedule initially | : '''Fast research iteration''' → Adam with liberal LR (1e-3), no schedule initially | ||
</div> | |||
== Analyzing == | <div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Analyzing</span> == | |||
{| class="wikitable" | {| class="wikitable" | ||
|+ Optimizer Comparison | |+ Optimizer Comparison | ||
| Line 103: | Line 114: | ||
'''Failure modes''': Learning rate too high → loss diverges (NaN). Too low → training stalls. Forgetting warmup → early instability from large random initial gradients. Exploding gradients in deep/recurrent networks without clipping. Adam's weight decay coupling bug (fixed by AdamW — always use AdamW, not Adam, for transformers). | '''Failure modes''': Learning rate too high → loss diverges (NaN). Too low → training stalls. Forgetting warmup → early instability from large random initial gradients. Exploding gradients in deep/recurrent networks without clipping. Adam's weight decay coupling bug (fixed by AdamW — always use AdamW, not Adam, for transformers). | ||
</div> | |||
== Evaluating == | <div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
Optimization evaluation: | == <span style="color: #FFFFFF;">Evaluating</span> == | ||
Optimization evaluation: | |||
# '''Loss curves''': plot train and validation loss per step. Smooth decrease in both = healthy training. Divergence (NaN), oscillation (LR too high), or stagnation (LR too low) indicate problems. | |||
# '''Learning rate sensitivity''': train with 3 LRs (10× apart); good optimizer should work across a range. | |||
# '''Gradient norms''': monitor gradient norm per step; sudden spikes precede divergence. | |||
# '''Wall-clock time to target accuracy''': the relevant production metric — convergence speed matters. | |||
</div> | |||
== Creating == | <div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
Designing an optimization strategy for a new model: | == <span style="color: #FFFFFF;">Creating</span> == | ||
Designing an optimization strategy for a new model: | |||
# Default: AdamW, lr=3e-4 (or 1e-3 for smaller models), betas=(0.9, 0.999), weight''decay=0.01–0.1. | |||
# Schedule: linear warmup (5–10% of total steps) + cosine decay to 0. | |||
# Clip gradients at max''norm=1.0 for stability. | |||
# Batch size: start with the maximum that fits in GPU memory; scale LR linearly if changing batch size. | |||
# Learning rate finder: use pytorch-lightning's LR finder to get a good initial estimate. | |||
# Monitor: W&B or TensorBoard; alert if loss becomes NaN or gradient norm explodes. | |||
[[Category:Artificial Intelligence]] | [[Category:Artificial Intelligence]] | ||
[[Category:Machine Learning]] | [[Category:Machine Learning]] | ||
[[Category:Optimization]] | [[Category:Optimization]] | ||
</div> | |||
Latest revision as of 01:55, 25 April 2026
How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?
Optimization algorithms in machine learning are the mathematical procedures that adjust model parameters to minimize a loss function. Every machine learning model is trained by an optimizer iterating over data, computing gradients, and updating weights. The choice of optimizer — and its hyperparameters — profoundly affects training speed, final performance, and stability. From classic stochastic gradient descent to adaptive methods like Adam and cutting-edge approaches for distributed training and LLMs, optimization is at the heart of all modern AI.
Remembering[edit]
- Loss function — A function measuring the discrepancy between model predictions and true labels; the objective to minimize.
- Gradient — The vector of partial derivatives of the loss with respect to all model parameters; points in the direction of steepest ascent.
- Gradient descent — Iteratively moving parameters in the negative gradient direction to minimize the loss.
- Stochastic Gradient Descent (SGD) — Gradient descent using the gradient of a single random example (or mini-batch) per step.
- Mini-batch SGD — Using a small batch of examples to estimate the gradient; the standard training approach.
- Learning rate — The step size in gradient descent; too large causes divergence, too small causes slow convergence.
- Momentum — An acceleration technique that accumulates a velocity vector in the gradient direction, dampening oscillations.
- Adam (Adaptive Moment Estimation) — An optimizer combining momentum with adaptive per-parameter learning rates; the default choice for most deep learning.
- AdamW — Adam with decoupled weight decay regularization; standard for training transformers.
- Learning rate schedule — Varying the learning rate during training: warmup, cosine decay, step decay.
- Warmup — Gradually increasing learning rate from near-zero at the start of training to prevent instability.
- Weight decay (L2 regularization) — Adding a penalty proportional to the sum of squared weights, preventing overfitting.
- Gradient clipping — Capping gradient magnitude to prevent exploding gradients, especially in RNNs and transformers.
- Batch size — The number of examples per gradient update; affects gradient variance, memory, and training dynamics.
- Learning rate finder — A technique for selecting a good learning rate by increasing it gradually and monitoring loss.
Understanding[edit]
Every ML model is trained by minimizing a loss function L(θ) over parameters θ. The loss surface is a high-dimensional landscape; optimization is the search for a good minimum.
Gradient descent takes steps proportional to the negative gradient: θ{t+1} = θt - α ∇L(θ_t). The challenge: computing the exact gradient over all training data is expensive. Mini-batch SGD approximates it with a small sample — cheap but noisy.
Why SGD noise helps: Counterintuitively, the noise from stochastic gradients often improves generalization. It helps escape sharp local minima, which tend to generalize poorly, and settles into wider, flatter minima that generalize better (sharp vs. flat minima theory).
Adaptive optimizers (Adam, RMSProp): Different parameters often need different learning rates. Adam maintains a running mean of gradient magnitudes per parameter and divides the learning rate by this estimate: parameters with large gradients (well-trained features) receive smaller effective learning rates; parameters with small gradients receive larger ones. This accelerates training on the heterogeneous loss landscapes of deep networks.
The Adam vs. SGD debate: Adam converges faster and requires less tuning. SGD with momentum + careful LR schedule often finds slightly better final solutions (used for ImageNet training). For transformers and LLMs, AdamW is the universal default.
Batch size considerations: Large batch training is efficient (GPU utilization) but changes gradient dynamics. Linear scaling rule: if you double batch size, double learning rate. Large batch training tends toward sharp minima and may generalize worse — addressed by linear warmup and LARS/LAMB optimizers for very large batches.
Applying[edit]
Learning rate schedule with warmup and cosine decay: <syntaxhighlight lang="python"> import torch import torch.nn as nn import math
class WarmupCosineSchedule(torch.optim.lr_scheduler._LRScheduler):
def __init__(self, optimizer, warmup_steps, total_steps, eta_min=0.0, last_epoch=-1):
self.warmup_steps = warmup_steps
self.total_steps = total_steps
self.eta_min = eta_min
super().__init__(optimizer, last_epoch)
def get_lr(self):
step = self.last_epoch
if step < self.warmup_steps:
# Linear warmup
return [base_lr * step / max(1, self.warmup_steps) for base_lr in self.base_lrs]
# Cosine decay
progress = (step - self.warmup_steps) / max(1, self.total_steps - self.warmup_steps)
return [self.eta_min + (base_lr - self.eta_min) * 0.5 * (1 + math.cos(math.pi * progress))
for base_lr in self.base_lrs]
- Standard transformer training recipe
model = nn.TransformerEncoder(...) optimizer = torch.optim.AdamW(
model.parameters(), lr=3e-4, betas=(0.9, 0.999), # Momentum coefficients eps=1e-8, weight_decay=0.1 # Decoupled L2 regularization
) scheduler = WarmupCosineSchedule(optimizer, warmup_steps=1000, total_steps=100000)
- Training step
for batch in dataloader:
loss = model(batch) loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # Gradient clipping optimizer.step() scheduler.step() optimizer.zero_grad()
</syntaxhighlight>
- Optimizer selection guide
- Standard deep learning → AdamW (default for transformers, NLP, vision)
- ResNet / CNN training → SGD + momentum (0.9) + cosine decay (ImageNet recipe)
- Very large batch → LARS (vision) or LAMB (NLP) for stable large-batch training
- RNN / LSTM → Adam + gradient clipping (max_norm=5.0)
- Fast research iteration → Adam with liberal LR (1e-3), no schedule initially
Analyzing[edit]
| Optimizer | Convergence Speed | Tuning Needed | Best Use Case |
|---|---|---|---|
| SGD + momentum | Slow | High (LR, momentum) | ResNet ImageNet training |
| Adam | Fast | Low | Most deep learning |
| AdamW | Fast | Low | Transformers, LLMs |
| RMSProp | Fast | Moderate | RNNs, RL |
| LARS | Fast (large batch) | Moderate | Large-batch vision |
| Lion | Fast, memory efficient | Moderate | LLM training (experimental) |
Failure modes: Learning rate too high → loss diverges (NaN). Too low → training stalls. Forgetting warmup → early instability from large random initial gradients. Exploding gradients in deep/recurrent networks without clipping. Adam's weight decay coupling bug (fixed by AdamW — always use AdamW, not Adam, for transformers).
Evaluating[edit]
Optimization evaluation:
- Loss curves: plot train and validation loss per step. Smooth decrease in both = healthy training. Divergence (NaN), oscillation (LR too high), or stagnation (LR too low) indicate problems.
- Learning rate sensitivity: train with 3 LRs (10× apart); good optimizer should work across a range.
- Gradient norms: monitor gradient norm per step; sudden spikes precede divergence.
- Wall-clock time to target accuracy: the relevant production metric — convergence speed matters.
Creating[edit]
Designing an optimization strategy for a new model:
- Default: AdamW, lr=3e-4 (or 1e-3 for smaller models), betas=(0.9, 0.999), weightdecay=0.01–0.1.
- Schedule: linear warmup (5–10% of total steps) + cosine decay to 0.
- Clip gradients at maxnorm=1.0 for stability.
- Batch size: start with the maximum that fits in GPU memory; scale LR linearly if changing batch size.
- Learning rate finder: use pytorch-lightning's LR finder to get a good initial estimate.
- Monitor: W&B or TensorBoard; alert if loss becomes NaN or gradient norm explodes.