Scaling Laws
How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?
Scaling laws describe the empirical relationships between the computational resources invested in training AI models — compute, parameters, and data — and the resulting model performance. These power-law relationships, discovered by OpenAI (Kaplan et al., 2020) and refined by DeepMind (Hoffmann et al., 2022 — "Chinchilla"), provide a principled framework for predicting model capabilities before training and allocating compute budgets optimally. Scaling laws also revealed the phenomenon of emergent abilities — capabilities that appear suddenly at certain scale thresholds — reshaping how the field thinks about AI development.
Remembering
- Scaling law — A power-law relationship between model performance and compute, parameters, or data.
- Compute (C) — Total floating point operations used in training, measured in FLOPs.
- Parameters (N) — The total number of learnable weights in the model.
- Tokens (D) — The number of training data tokens the model was trained on.
- Cross-entropy loss — The primary metric in scaling law studies; lower = better language modeling.
- Kaplan scaling laws — OpenAI's 2020 paper showing loss scales as a power law with N, D, C.
- Chinchilla scaling laws — DeepMind's 2022 finding that optimal training allocates equal compute to model size and training data (N ∝ D).
- Compute-optimal model — A model trained with the optimal N and D for a given compute budget, per Chinchilla.
- Emergent ability — A capability that appears only at certain model scale, not visible in smaller models.
- Phase transition — Abrupt, discontinuous improvement in a capability as scale increases.
- Irreducible loss — The minimum achievable loss on the data distribution; sets a floor on scaling improvements.
- Inference scaling — Using additional compute at inference time (more tokens of reasoning, chain-of-thought, search) to improve outputs.
- Test-time compute — Compute spent during inference to improve answer quality (e.g., best-of-N sampling, process reward models).
Understanding
Kaplan et al. (2020) found that language model loss L follows power laws: - L(N) ∝ N^(-α) — larger models achieve lower loss - L(D) ∝ D^(-β) — more training data achieves lower loss - L(C) ∝ C^(-γ) — more compute achieves lower loss
The Kaplan paper suggested: given a compute budget C, put most of it into a large N with modest D. This led to large, undertrained models (GPT-3: 175B parameters, only 300B tokens).
Chinchilla's revision: Hoffmann et al. (2022) ran more careful experiments and found the optimal scaling is N ∝ D — equal scaling of parameters and tokens. Chinchilla (70B parameters, 1.4T tokens) matched GPT-3 (175B, 300B tokens) despite being 2.5× smaller — because GPT-3 was severely undertrained. The implication: many 2020-era models were dramatically suboptimal; compute was wasted on parameters that could have been used for more training data.
Emergent abilities: Some capabilities appear to jump discontinuously at certain scales — not a smooth improvement but a phase transition. Examples: chain-of-thought reasoning, multi-step arithmetic, code generation, complex analogy. These are not predicted by extrapolating smaller model behavior and remain partially unexplained. Some researchers argue "emergence" is an artifact of metric choices (continuous metrics show smooth improvement; pass/fail metrics show apparent jumps).
Inference scaling (the new frontier): After training-time scaling laws, researchers discovered that more compute at *inference* time also improves quality — via best-of-N sampling, process reward models (PRMs), and extended chain-of-thought reasoning. Models like OpenAI o1 and DeepSeek-R1 leverage inference-time scaling for dramatically improved reasoning.
Applying
Estimating optimal model size and data for a compute budget: <syntaxhighlight lang="python"> def chinchilla_optimal(compute_budget_flops: float) -> dict:
"""
Estimate compute-optimal N (parameters) and D (tokens) per Chinchilla.
Training FLOP approximation: C ≈ 6 * N * D
Chinchilla optimal: N ≈ D (equal scaling)
Solving: C = 6 * N * D and N = D → N = D = sqrt(C/6)
"""
import math
# Chinchilla fit: N_opt ≈ C^0.5 / sqrt(6), D_opt ≈ C^0.5 / sqrt(6)
# More precise Chinchilla coefficients (from paper Table A3):
# N_opt = 0.1715 * C^0.5, D_opt = 0.1715 * C^0.5 (approx)
N_opt = 0.1715 * math.sqrt(compute_budget_flops)
D_opt = 0.1715 * math.sqrt(compute_budget_flops)
return {
"optimal_parameters": f"{N_opt/1e9:.1f}B",
"optimal_tokens": f"{D_opt/1e9:.1f}B",
"compute_budget": f"{compute_budget_flops/1e24:.2f} × 10^24 FLOPs"
}
- Example: budget of 10^23 FLOPs (training a ~7B model)
budget = 1e23 result = chinchilla_optimal(budget) print(result)
- → {'optimal_parameters': '54.3B', 'optimal_tokens': '54.3B', ...}
- Real models use more data: Llama 3 (8B) trained on 15T tokens = 300× "chinchilla optimal"
- Post-Chinchilla finding: more data than Chinchilla optimal still helps at inference
</syntaxhighlight>
- Scaling law context for major models
- GPT-3 (175B, 300B tokens) → Severely undertrained per Chinchilla
- Chinchilla (70B, 1.4T tokens) → First compute-optimal model; matched GPT-3
- Llama 2 (70B, 2T tokens) → Inference-optimal: overtrained for smaller deployment
- Llama 3 (8B, 15T tokens) → Heavily overtrained; optimized for inference budget
- GPT-4 / Gemini Ultra → Unknown; estimated 1T+ parameters, multi-epoch training
Analyzing
| Dimension | Effect on Loss | Diminishing Returns | Practical Limit |
|---|---|---|---|
| Parameters (N) | Power-law improvement | Yes (α ≈ 0.076) | GPU memory |
| Training tokens (D) | Power-law improvement | Yes (β ≈ 0.095) | Data availability / quality |
| Training compute (C) | Power-law improvement | Yes (γ ≈ 0.050) | Cost |
| Inference compute | Improves reasoning | Yes | Latency budget |
| Context length | Enables new tasks | Task-dependent | Quadratic attention cost |
Failure modes: Benchmark saturation — as models improve, benchmarks get saturated, making it hard to measure progress. Emergent abilities may not appear for specific domains even at high scale (domain-specific emergence thresholds differ). Data quality is not captured in scaling laws — more low-quality tokens can hurt. Scaling laws derived on language models may not transfer to other modalities.
Evaluating
Scaling law evaluation: (1) Plot loss vs. compute on log-log scale — power-law fit implies predictable scaling. (2) Measure downstream task performance vs. scale — correlates with loss but not perfectly. (3) Test emergent capabilities at multiple scale checkpoints — identify when capabilities appear. (4) Compute efficiency: FLOPs used vs. final quality; compare to Chinchilla optimal as baseline. Expert practitioners evaluate their models against the scaling law predictions and investigate deviations.
Creating
Using scaling laws for model development decisions: (1) Estimate compute budget (dollars → GPU-hours → FLOPs). (2) Use Chinchilla formula to find compute-optimal N and D. (3) Adjust for inference: if deploying many queries, overtrain smaller model (more D, less N) for lower inference cost. (4) Run scaling experiments at 1/100 compute to validate law holds for your data/architecture. (5) Use loss as the leading indicator; downstream task improvements follow with some delay.