Longterm Wiki
Navigation
Updated 2026-03-12HistoryData
Page StatusContent
Edited 1 day ago2.5k words4 backlinksUpdated quarterlyDue in 13 weeks
92QualityComprehensive93ImportanceEssential92.5ResearchCritical
Content6/13
LLM summaryScheduleEntityEdit history2Overview
Tables0/ ~10Diagrams0/ ~1Int. links12/ ~20Ext. links30/ ~12Footnotes0/ ~7References9/ ~7Quotes0Accuracy0Backlinks4
Change History2
Surface tacticalValue in /wiki table and score 53 pages3 weeks ago

Added `tacticalValue` to `ExploreItem` interface, `getExploreItems()` mappings, the `/wiki` explore table (new sortable "Tact." column), and the card view sort dropdown. Scored 49 new pages with tactical values (4 were already scored), bringing total to 53.

sonnet-4 · ~30min

Improve top 5 foundational wiki pages#1883 weeks ago

Improved the 5 highest-importance, lowest-quality wiki pages using the Crux content pipeline. All were stubs (7 words) or had quality=0 and are now comprehensive articles with citations, EntityLinks, and balanced perspectives.

Issues2
Links9 links could use <R> components
StructureNo tables or diagrams - consider adding visual content

AI Scaling Laws

Concept

AI Scaling Laws

Empirical relationships between compute, data, parameters, and AI performance

Related
Organizations
Epoch AI
2.5k words · 4 backlinks

AI Scaling Laws describe empirical power-law relationships that govern how neural network performance improves with increases in model size, dataset size, and computational resources. These relationships have become foundational to modern AI development, enabling researchers to predict model capabilities and optimize resource allocation during training.

Overview

Scaling laws characterize how test loss decreases as key factors increase. The relationships follow predictable mathematical patterns across multiple orders of magnitude, allowing organizations to forecast performance improvements and make strategic decisions about model architecture and training approaches.

The discovery of these patterns in 2020 fundamentally changed AI development strategy, leading labs to prioritize scaling as a primary path to capability improvements. Dario Amodei and Ilya Sutskever have stated that the discovery of scaling laws led them to pursue the current large language model paradigm.1

Key Variables

Scaling laws examine the relationship between model performance (typically measured by test loss) and three primary factors:

  • Model size (N): Number of parameters in the neural network
  • Dataset size (D): Number of training tokens or examples
  • Compute budget (C): Total computational resources (FLOPs) used for training

Kaplan Scaling Laws (2020)

In January 2020, researchers at OpenAI published foundational work demonstrating that loss scales as a power-law with model size, dataset size, and training compute across more than seven orders of magnitude.2

Mathematical Formulation

The Kaplan et al. paper established that test loss follows:

L = A/N^α + B/D^β + L₀

Where:

  • L is the test loss
  • N is the number of model parameters
  • D is the dataset size (number of tokens)
  • α, β are scaling exponents
  • A, B, L₀ are constants

The research found that when not bottlenecked by the other factors, each variable exhibits power-law behavior independently.2

Key Findings

The 2020 scaling laws revealed several important patterns:2

  • Larger models are significantly more sample-efficient, achieving lower loss with fewer training tokens
  • For optimal compute-efficient training, models should be trained on relatively modest data and stopped before convergence
  • Performance depends strongly on scale but weakly on model shape (depth vs. width)
  • The critical batch size approximately doubles for every 13% decrease in loss

Compute Allocation Strategy

The original Kaplan scaling laws suggested that as the pre-training compute budget increases, model size should scale faster than data. Specifically, with a 10x increase in training budget, the optimal strategy was to scale model size by 5.5x and data by 1.8x.3

This approach influenced GPT-3's development, which was trained on 175 billion parameters with approximately 300 billion tokens, yielding a ratio of only 1.7 tokens per parameter.3 The training required approximately 3.15 × 10²³ FLOPs.4

Chinchilla Scaling Laws (2022)

In March 2022, Google DeepMind researchers published revised scaling laws that fundamentally changed understanding of optimal training strategies.5

Revised Compute-Optimal Training

The Chinchilla paper trained over 400 language models ranging from 70 million to 16 billion parameters on 5 to 500 billion tokens. The research found that for compute-optimal training, model size and training tokens should be scaled equally—for every doubling of model size, training tokens should also double.5

This contrasted sharply with the Kaplan scaling laws. Where Kaplan suggested scaling parameters with exponent a = 0.73 and data with exponent b = 0.27, Chinchilla analysis showed both exponents should be approximately 0.5.6

Mathematical Relationship

Under Chinchilla scaling laws, both optimal parameters and tokens grow as the square root of compute:7

  • N_opt ∝ C^0.5
  • D_opt ∝ C^0.5

This yields an optimal ratio of approximately 20 tokens per parameter for compute-optimal performance.5

Chinchilla Model Performance

DeepMind demonstrated the revised scaling laws by training Chinchilla, a 70-billion parameter model on 1.4 trillion tokens (20 tokens per parameter). Despite using the same compute budget as their 280-billion parameter Gopher model, Chinchilla achieved superior performance:5

  • 67.5% accuracy on MMLU benchmark (compared to Gopher's 60.0%)
  • Uniformly outperformed Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) across evaluated tasks

The research revealed that previous models like GPT-3 were significantly undertrained relative to compute-optimal standards.7

Implications for Training Strategy

The Chinchilla findings suggested that for Gopher's training budget, a 4x smaller model trained on 4x more data would have been preferable.8 This insight led the AI industry to reconsider the balance between model size and training duration.

Power Law Relationships

Mathematical Structure

Neural scaling laws exhibit power-law relationships, meaning they maintain scale-invariance: f(xc) ∝ f(x).9 The general form of the loss function incorporates terms for both model size and dataset size:9

L = A/N^α + B/D^β + L₀

Where scaling exponents typically fall in ranges:

  • α ∈ [0.3, 0.5]
  • β ≈ 0.6

For Chinchilla specifically, α ≈ 0.34 and β ≈ 0.28.9

Four Scaling Regimes

Recent theoretical work has identified four distinct scaling regimes:10

  1. Variance-limited (small dataset): Performance constrained by data scarcity
  2. Resolution-limited (small dataset): Performance limited by model capacity to learn from limited data
  3. Variance-limited (large dataset): Performance constrained by model optimization
  4. Resolution-limited (large dataset): Performance limited by model expressiveness

Understanding these regimes helps predict when scaling different factors will yield improvements.

Practical Applications

Compute Budget Allocation

Organizations use scaling laws to determine optimal resource allocation. For a given computational budget, the compute-optimal strategy divides resources approximately equally between model parameters and training tokens.7

This has practical implications for training decisions:

  • A model trained to 20 tokens per parameter is likely compute-optimal
  • Training beyond this ratio yields diminishing returns per FLOP
  • Undertraining (fewer tokens per parameter) wastes potential model capacity

Predictive Capability

Scaling laws enable prediction of model performance before expensive training runs. Researchers can:

  • Extrapolate performance from smaller pilot experiments
  • Estimate capabilities of future models given projected compute availability
  • Make informed decisions about architecture choices and training duration

Anthropic used this predictive capability when scaling sparse autoencoders to production models, tuning methods at affordable scale before deploying on Claude Sonnet.11

Model Design Decisions

Meta's development of LLaMA 3 revealed new observations on scaling behavior. Models trained on 15 trillion tokens continued improving log-linearly, demonstrating performance gains beyond Chinchilla-optimal training. The research also developed scaling laws for optimal data mix selection.12

Emergent Abilities and Scale

Definition and Examples

Wei et al. (2022) documented emergent abilities—capabilities not present in smaller models that appear in larger ones.13 These abilities manifest unpredictably at certain scale thresholds.

Key findings:

  • Chain of thought prompting only surpasses standard prompting when scaled to approximately 10²³ training FLOPs (roughly 100 billion parameters)
  • Specialized prompting or finetuning methods can be emergent, showing no positive effects until certain model scales
  • Abilities appear suddenly rather than gradually as model size increases

Measurement Controversy

Schaeffer et al. (2023) challenged the emergent abilities framework, arguing that apparent emergence results from metric choice rather than fundamental model behavior changes.14

Their analysis found:

  • Nonlinear or discontinuous metrics (like exact match accuracy) produce apparent emergent abilities
  • Linear or continuous metrics (like token edit distance) reveal smooth, predictable improvements
  • Of 29 examined metrics, 25 showed no emergent properties and revealed continuous, linear growth with scale

This suggests that emergence may partially reflect measurement artifacts rather than fundamental capability discontinuities.

Limitations and Breakdown Points

Broken Power Laws

Caballero et al. (2022) proposed "smoothly broken power laws" to better model complex scaling behaviors.15 This approach:

  • Captures phenomena like double descent that traditional power laws cannot express
  • Models scaling as linear segments connected by smooth breaks on log-log plots
  • Yields extrapolations with RMSE 0.86 times that of traditional power law methods

Double Descent

OpenAI documented the double descent phenomenon in 2019: as parameters increase, test error initially decreases, increases (overfitting), then undergoes a second descent.16 This contradicts classical assumptions about overfitting and reveals non-monotonic scaling behavior.

Hierarchy Loss

Bahri et al. (2021) observed breakdown of predicted scaling behavior when model parameters become very large and structural hierarchy is lost.17 This suggests fundamental limits may exist to simple power-law extrapolation.

Test-Time Compute Scaling

Inference-Time Reasoning

OpenAI's o1 model introduced a new scaling dimension: performance improvements from increased test-time compute. The model's accuracy on AIME 2024 demonstrates this effect:18

  • Single sample: 74% accuracy
  • Consensus among 64 samples: 83% accuracy
  • Re-ranking 1,000 samples: 93% accuracy

This represents a shift from improving model training to improving inference-time computation through extended reasoning processes.19

Test-Time Compute Methods

Models like o1 dynamically increase reasoning time during inference, spending more time on complex questions. This improves accuracy at the cost of higher computational expense. Implementations use step-by-step chain-of-thought reasoning to explore solution spaces more thoroughly.20

Scaling Beyond Language Models

World Models and Agents

Recent research demonstrates that power laws extend beyond language modeling to world models and imitation learning.21 Performance of embodied agents improves with increases in model parameters, dataset size, and compute across domains from robotics to video games.

However, coefficients are heavily influenced by tokenizer choice, task characteristics, and architecture, limiting direct transfer of language model scaling laws to other domains.21

Vision and Multimodal Models

Scaling law principles apply to image classification, neural machine translation, and other modalities, though specific exponents and relationships differ.22 Research provides methods for estimating scaling law parameters reliably from learning curves across diverse architecture families.

Mixture of Experts

Scaling laws for Mixture-of-Experts (MoE) architectures differ from dense transformers. Recent work suggests:23

  • Optimal number of activated experts is approximately 7 for classical MoE settings
  • Contrary to earlier results, MoE architectures can be more efficient than dense transformers regardless of model size
  • Expert granularity significantly affects optimal configurations

Transfer Learning and Fine-tuning

Transfer Scaling Laws

Research on transfer learning reveals that pre-training data size and fine-tuning data size both influence downstream performance through distinct scaling relationships.24 The "transfer gap" measures the degree of transfer from pre-training to downstream distributions.

Measuring this gap from less expensive pre-training runs can predict improvements from more costly downstream training.24

Fine-tuning Dataset Size

The choice of pre-training data and its size affect downstream task performance. Distribution alignment between pre-training and downstream data significantly influences scaling behavior, with implications for how much fine-tuning data is needed to achieve target performance.25

Data Quality vs. Quantity

Compute-Dependent Filtering

Recent work challenges the assumption that data filtering decisions can be made independently of compute budget.26 Key findings:

  • When training with low compute budgets, data quality produces better results
  • At larger computing scales, diminishing utility of limited high-quality data makes quantity increasingly important
  • Optimal filtering strategies should account for the computational budget

Data Distribution Effects

Test error scaling depends on properties of the data distribution. Below a critical threshold, power-law distributions of subtask difficulties govern scaling. Above this threshold, a single dominant structure controls the relationship between error and scale.27

Inference Cost Scaling

Economic Implications

While training costs are incurred once, inference costs accumulate over a model's deployment lifetime. Epoch AI analysis indicates:28

  • Inference latency scales with the square root of model size and cube root of memory bandwidth
  • Cumulative inference costs can equal or exceed initial training costs for widely deployed models
  • Inference revenue at major AI companies grows at 3x per year or more

This means compute-optimal training must consider not just training efficiency but also inference cost implications of model size choices.

Organizational Research Programs

OpenAI

OpenAI's 2020 scaling laws research directly influenced GPT-3's development and established the foundation for subsequent large language model development.29 The organization continues researching test-time compute scaling with the o1 series.18

Google DeepMind

Google DeepMind's Chinchilla research (2022) revised understanding of compute-optimal training, showing that model parameters and training tokens should scale equally.5 This work analyzed over 400 models and demonstrated substantial efficiency gains.

Anthropic

Anthropic has applied scaling law principles to interpretability research, using them to guide sparse autoencoder training for Claude 3 Sonnet.30 The organization used scaling laws to extrapolate from smaller experiments to production-grade models, successfully extracting interpretable features from a frontier model.11

Meta

Meta's LLaMA research program has explored scaling beyond Chinchilla-optimal training, finding that models trained on 15 trillion tokens continued improving log-linearly.12 The research also developed scaling laws for data mix selection, achieving over 400 TFLOPS per GPU on 16,000 GPUs simultaneously.

Open Questions and Future Directions

Several fundamental questions remain about scaling laws:

  1. Scaling limits: Whether power-law relationships continue indefinitely or break down at larger scales
  2. Architecture dependence: How different architectures (transformers, state space models, etc.) exhibit different scaling properties
  3. Synthetic data: How scaling laws apply when training on model-generated data rather than human-produced data
  4. Multimodal scaling: Optimal resource allocation across different modalities (text, image, video, audio)
  5. Capability prediction: Whether specific capabilities (coding, reasoning, factual recall) scale predictably or exhibit emergence

The field continues to refine understanding of these relationships, with implications for AI development strategy and capability forecasting.

Footnotes

  1. EA Forum community (2024). The Scaling Paradox.

  2. Kaplan, J., et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361. 2 3

  3. Vet, J. (2024). A brief history of LLM Scaling Laws and what to expect in 2025. 2

  4. Brenndoerfer, M. (2024). GPT-3: Scale, Few-Shot Learning & In-Context Learning Discovery.

  5. Hoffmann, J., et al. (2022). Training Compute-Optimal Large Language Models. arXiv:2203.15556. 2 3 4 5

  6. Sun, P. (2024). AI Scaling Laws Explained.

  7. Brenndoerfer, M. (2024). Chinchilla Scaling Laws: Compute-Optimal Training. 2 3

  8. DeepMind (2022). An empirical analysis of compute-optimal large language model training.

  9. Wikipedia contributors (2024). Neural scaling law. 2 3

  10. Bahri, Y., et al. (2024). Explaining neural scaling laws. PNAS.

  11. Anthropic (2024). Mapping the Mind of a Large Language Model. 2

  12. Meta AI (2024). Introducing Meta Llama 3: The most capable openly available LLM to date. 2

  13. Wei, J., et al. (2022). Emergent Abilities of Large Language Models. arXiv:2206.07682.

  14. Schaeffer, R., et al. (2023). Are Emergent Abilities of Large Language Models a Mirage?. arXiv:2304.15004.

  15. Caballero, E., et al. (2022). Broken Neural Scaling Laws. arXiv:2210.14891.

  16. OpenAI (2019). Deep double descent.

  17. Bahri, Y., et al. (2021). Explaining Neural Scaling Laws. arXiv:2102.06701.

  18. OpenAI (2024). Learning to reason with LLMs. 2

  19. EA Forum community (2024). Inference Scaling and the Log-x Chart.

  20. Hugging Face (2024). What is test-time compute and how to scale it?

  21. Various authors (2024). Scaling Laws for Pre-training Agents and World Models. 2

  22. Various authors (2024). Revisiting Neural Scaling Laws in Language and Vision.

  23. Various authors (2025). Towards a Comprehensive Scaling Law of Mixture-of-Experts.

  24. Various authors (2024). An Empirical Study of Scaling Laws for Transfer. 2

  25. Various authors (2024). Scaling Laws for Downstream Task Performance of Large Language Models.

  26. Various authors (2024). Scaling Laws for Data Filtering— Data Curation cannot be Compute Agnostic.

  27. Various authors (2024). Neural Scaling Laws Rooted in the Data Distribution.

  28. Epoch AI (2024). Inference economics of language models.

  29. OpenAI (2020). Scaling laws for neural language models.

  30. Anthropic (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.

References

1Kaplan et al. (2020)arXiv·Jared Kaplan et al.·2020·Paper
★★★☆☆
2Hoffmann et al. (2022)arXiv·Jordan Hoffmann et al.·2022·Paper
★★★☆☆
4Emergent AbilitiesarXiv·Jason Wei et al.·2022·Paper
★★★☆☆
5"Are Emergent Abilities a Mirage?"arXiv·Rylan Schaeffer, Brando Miranda & Sanmi Koyejo·2023·Paper
★★★☆☆
6OpenAI's o1OpenAI
★★★★☆
7Llama 3Meta AI
★★★★☆
8transformer-circuits.pub·Paper

Related Pages

Top Related Pages

Approaches

AI-Augmented ForecastingPrediction Markets (AI Forecasting)

Analysis

Safety-Capability Tradeoff ModelAI Capability Threshold ModelAI Risk Activation Timeline ModelAI-Bioweapons Timeline Model

Other

Dario AmodeiIlya SutskeverPhilip Tetlock (Forecasting Pioneer)Eli Lifland

Concepts

AI TimelinesAGI TimelineCapability EvaluationsAI BenchmarkingReasoning and PlanningLarge Language Models

Organizations

Google DeepMind

Key Debates

AI Risk Critical Uncertainties ModelIs Scaling All You Need?AI Accident Risk Cruxes