Editing
Scaling Laws and Emergent Abilities
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> {{BloomIntro}} Scaling laws describe the empirical relationships between the computational resources invested in training AI models β compute, parameters, and data β and the resulting model performance. These power-law relationships, discovered by OpenAI (Kaplan et al., 2020) and refined by DeepMind (Hoffmann et al., 2022 β "Chinchilla"), provide a principled framework for predicting model capabilities before training and allocating compute budgets optimally. Scaling laws also revealed the phenomenon of emergent abilities β capabilities that appear suddenly at certain scale thresholds β reshaping how the field thinks about AI development. </div> __TOC__ <div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Remembering</span> == * '''Scaling law''' β A power-law relationship between model performance and compute, parameters, or data. * '''Compute (C)''' β Total floating point operations used in training, measured in FLOPs. * '''Parameters (N)''' β The total number of learnable weights in the model. * '''Tokens (D)''' β The number of training data tokens the model was trained on. * '''Cross-entropy loss''' β The primary metric in scaling law studies; lower = better language modeling. * '''Kaplan scaling laws''' β OpenAI's 2020 paper showing loss scales as a power law with N, D, C. * '''Chinchilla scaling laws''' β DeepMind's 2022 finding that optimal training allocates equal compute to model size and training data (N β D). * '''Compute-optimal model''' β A model trained with the optimal N and D for a given compute budget, per Chinchilla. * '''Emergent ability''' β A capability that appears only at certain model scale, not visible in smaller models. * '''Phase transition''' β Abrupt, discontinuous improvement in a capability as scale increases. * '''Irreducible loss''' β The minimum achievable loss on the data distribution; sets a floor on scaling improvements. * '''Inference scaling''' β Using additional compute at inference time (more tokens of reasoning, chain-of-thought, search) to improve outputs. * '''Test-time compute''' β Compute spent during inference to improve answer quality (e.g., best-of-N sampling, process reward models). </div> <div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Understanding</span> == Kaplan et al. (2020) found that language model loss L follows power laws: - L(N) β N^(-Ξ±) β larger models achieve lower loss - L(D) β D^(-Ξ²) β more training data achieves lower loss - L(C) β C^(-Ξ³) β more compute achieves lower loss The Kaplan paper suggested: given a compute budget C, put most of it into a large N with modest D. This led to large, undertrained models (GPT-3: 175B parameters, only 300B tokens). **Chinchilla's revision**: Hoffmann et al. (2022) ran more careful experiments and found the optimal scaling is N β D β equal scaling of parameters and tokens. Chinchilla (70B parameters, 1.4T tokens) matched GPT-3 (175B, 300B tokens) despite being 2.5Γ smaller β because GPT-3 was severely undertrained. The implication: many 2020-era models were dramatically suboptimal; compute was wasted on parameters that could have been used for more training data. **Emergent abilities**: Some capabilities appear to jump discontinuously at certain scales β not a smooth improvement but a phase transition. Examples: chain-of-thought reasoning, multi-step arithmetic, code generation, complex analogy. These are not predicted by extrapolating smaller model behavior and remain partially unexplained. Some researchers argue "emergence" is an artifact of metric choices (continuous metrics show smooth improvement; pass/fail metrics show apparent jumps). **Inference scaling (the new frontier)**: After training-time scaling laws, researchers discovered that more compute at *inference* time also improves quality β via best-of-N sampling, process reward models (PRMs), and extended chain-of-thought reasoning. Models like OpenAI o1 and DeepSeek-R1 leverage inference-time scaling for dramatically improved reasoning. </div> <div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Applying</span> == '''Estimating optimal model size and data for a compute budget:''' <syntaxhighlight lang="python"> def chinchilla_optimal(compute_budget_flops: float) -> dict: """ Estimate compute-optimal N (parameters) and D (tokens) per Chinchilla. Training FLOP approximation: C β 6 * N * D Chinchilla optimal: N β D (equal scaling) Solving: C = 6 * N * D and N = D β N = D = sqrt(C/6) """ import math # Chinchilla fit: N_opt β C^0.5 / sqrt(6), D_opt β C^0.5 / sqrt(6) # More precise Chinchilla coefficients (from paper Table A3): # N_opt = 0.1715 * C^0.5, D_opt = 0.1715 * C^0.5 (approx) N_opt = 0.1715 * math.sqrt(compute_budget_flops) D_opt = 0.1715 * math.sqrt(compute_budget_flops) return { "optimal_parameters": f"{N_opt/1e9:.1f}B", "optimal_tokens": f"{D_opt/1e9:.1f}B", "compute_budget": f"{compute_budget_flops/1e24:.2f} Γ 10^24 FLOPs" } # Example: budget of 10^23 FLOPs (training a ~7B model) budget = 1e23 result = chinchilla_optimal(budget) print(result) # β {'optimal_parameters': '54.3B', 'optimal_tokens': '54.3B', ...} # Real models use more data: Llama 3 (8B) trained on 15T tokens = 300Γ "chinchilla optimal" # Post-Chinchilla finding: more data than Chinchilla optimal still helps at inference </syntaxhighlight> ; Scaling law context for major models : '''GPT-3 (175B, 300B tokens)''' β Severely undertrained per Chinchilla : '''Chinchilla (70B, 1.4T tokens)''' β First compute-optimal model; matched GPT-3 : '''Llama 2 (70B, 2T tokens)''' β Inference-optimal: overtrained for smaller deployment : '''Llama 3 (8B, 15T tokens)''' β Heavily overtrained; optimized for inference budget : '''GPT-4 / Gemini Ultra''' β Unknown; estimated 1T+ parameters, multi-epoch training </div> <div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Analyzing</span> == {| class="wikitable" |+ Scaling Dimensions and Their Effects ! Dimension !! Effect on Loss !! Diminishing Returns !! Practical Limit |- | Parameters (N) || Power-law improvement || Yes (Ξ± β 0.076) || GPU memory |- | Training tokens (D) || Power-law improvement || Yes (Ξ² β 0.095) || Data availability / quality |- | Training compute (C) || Power-law improvement || Yes (Ξ³ β 0.050) || Cost |- | Inference compute || Improves reasoning || Yes || Latency budget |- | Context length || Enables new tasks || Task-dependent || Quadratic attention cost |} '''Failure modes''': Benchmark saturation β as models improve, benchmarks get saturated, making it hard to measure progress. Emergent abilities may not appear for specific domains even at high scale (domain-specific emergence thresholds differ). Data quality is not captured in scaling laws β more low-quality tokens can hurt. Scaling laws derived on language models may not transfer to other modalities. </div> <div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Evaluating</span> == Scaling law evaluation: (1) Plot loss vs. compute on log-log scale β power-law fit implies predictable scaling. (2) Measure downstream task performance vs. scale β correlates with loss but not perfectly. (3) Test emergent capabilities at multiple scale checkpoints β identify when capabilities appear. (4) Compute efficiency: FLOPs used vs. final quality; compare to Chinchilla optimal as baseline. Expert practitioners evaluate their models against the scaling law predictions and investigate deviations. </div> <div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Creating</span> == Using scaling laws for model development decisions: (1) Estimate compute budget (dollars β GPU-hours β FLOPs). (2) Use Chinchilla formula to find compute-optimal N and D. (3) Adjust for inference: if deploying many queries, overtrain smaller model (more D, less N) for lower inference cost. (4) Run scaling experiments at 1/100 compute to validate law holds for your data/architecture. (5) Use loss as the leading indicator; downstream task improvements follow with some delay. [[Category:Artificial Intelligence]] [[Category:Large Language Models]] [[Category:Deep Learning]] </div>
Summary:
Please note that all contributions to BloomWiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
BloomWiki:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Template used on this page:
Template:BloomIntro
(
edit
)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information