AI Scaling Laws

Concept

AI Scaling Laws

Empirical relationships between compute, data, parameters, and AI performance

Organizations

2.5k words · 4 backlinks

AI Scaling Laws describe empirical power-law relationships that govern how neural network performance improves with increases in model size, dataset size, and computational resources. These relationships have become foundational to modern AI development, enabling researchers to predict model capabilities and optimize resource allocation during training.

Overview

Scaling laws characterize how test loss decreases as key factors increase. The relationships follow predictable mathematical patterns across multiple orders of magnitude, allowing organizations to forecast performance improvements and make strategic decisions about model architecture and training approaches.

The discovery of these patterns in 2020 fundamentally changed AI development strategy, leading labs to prioritize scaling as a primary path to capability improvements. Dario Amodei and Ilya Sutskever have stated that the discovery of scaling laws led them to pursue the current large language model paradigm.¹

Key Variables

Scaling laws examine the relationship between model performance (typically measured by test loss) and three primary factors:

Model size (N): Number of parameters in the neural network
Dataset size (D): Number of training tokens or examples
Compute budget (C): Total computational resources (FLOPs) used for training

Kaplan Scaling Laws (2020)

In January 2020, researchers at OpenAI published foundational work demonstrating that loss scales as a power-law with model size, dataset size, and training compute across more than seven orders of magnitude.²

Mathematical Formulation

The Kaplan et al. paper established that test loss follows:

L = A/N^α + B/D^β + L₀

Where:

L is the test loss
N is the number of model parameters
D is the dataset size (number of tokens)
α, β are scaling exponents
A, B, L₀ are constants

The research found that when not bottlenecked by the other factors, each variable exhibits power-law behavior independently.²

Key Findings

The 2020 scaling laws revealed several important patterns:²

Larger models are significantly more sample-efficient, achieving lower loss with fewer training tokens
For optimal compute-efficient training, models should be trained on relatively modest data and stopped before convergence
Performance depends strongly on scale but weakly on model shape (depth vs. width)
The critical batch size approximately doubles for every 13% decrease in loss

Compute Allocation Strategy

The original Kaplan scaling laws suggested that as the pre-training compute budget increases, model size should scale faster than data. Specifically, with a 10x increase in training budget, the optimal strategy was to scale model size by 5.5x and data by 1.8x.³

This approach influenced GPT-3's development, which was trained on 175 billion parameters with approximately 300 billion tokens, yielding a ratio of only 1.7 tokens per parameter.³ The training required approximately 3.15 × 10²³ FLOPs.⁴

Chinchilla Scaling Laws (2022)

In March 2022, Google DeepMind researchers published revised scaling laws that fundamentally changed understanding of optimal training strategies.⁵

Revised Compute-Optimal Training

The Chinchilla paper trained over 400 language models ranging from 70 million to 16 billion parameters on 5 to 500 billion tokens. The research found that for compute-optimal training, model size and training tokens should be scaled equally—for every doubling of model size, training tokens should also double.⁵

This contrasted sharply with the Kaplan scaling laws. Where Kaplan suggested scaling parameters with exponent a = 0.73 and data with exponent b = 0.27, Chinchilla analysis showed both exponents should be approximately 0.5.⁶

Mathematical Relationship

Under Chinchilla scaling laws, both optimal parameters and tokens grow as the square root of compute:⁷

N_opt ∝ C^0.5
D_opt ∝ C^0.5

This yields an optimal ratio of approximately 20 tokens per parameter for compute-optimal performance.⁵

Chinchilla Model Performance

DeepMind demonstrated the revised scaling laws by training Chinchilla, a 70-billion parameter model on 1.4 trillion tokens (20 tokens per parameter). Despite using the same compute budget as their 280-billion parameter Gopher model, Chinchilla achieved superior performance:⁵

67.5% accuracy on MMLU benchmark (compared to Gopher's 60.0%)
Uniformly outperformed Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) across evaluated tasks

The research revealed that previous models like GPT-3 were significantly undertrained relative to compute-optimal standards.⁷

Implications for Training Strategy

The Chinchilla findings suggested that for Gopher's training budget, a 4x smaller model trained on 4x more data would have been preferable.⁸ This insight led the AI industry to reconsider the balance between model size and training duration.

Power Law Relationships

Mathematical Structure

Neural scaling laws exhibit power-law relationships, meaning they maintain scale-invariance: f(xc) ∝ f(x).⁹ The general form of the loss function incorporates terms for both model size and dataset size:⁹

L = A/N^α + B/D^β + L₀

Where scaling exponents typically fall in ranges:

α ∈ [0.3, 0.5]
β ≈ 0.6

For Chinchilla specifically, α ≈ 0.34 and β ≈ 0.28.⁹

Four Scaling Regimes

Recent theoretical work has identified four distinct scaling regimes:¹⁰

Variance-limited (small dataset): Performance constrained by data scarcity
Resolution-limited (small dataset): Performance limited by model capacity to learn from limited data
Variance-limited (large dataset): Performance constrained by model optimization
Resolution-limited (large dataset): Performance limited by model expressiveness

Understanding these regimes helps predict when scaling different factors will yield improvements.

Practical Applications

Compute Budget Allocation

Organizations use scaling laws to determine optimal resource allocation. For a given computational budget, the compute-optimal strategy divides resources approximately equally between model parameters and training tokens.⁷

This has practical implications for training decisions:

A model trained to 20 tokens per parameter is likely compute-optimal
Training beyond this ratio yields diminishing returns per FLOP
Undertraining (fewer tokens per parameter) wastes potential model capacity

Predictive Capability

Scaling laws enable prediction of model performance before expensive training runs. Researchers can:

Extrapolate performance from smaller pilot experiments
Estimate capabilities of future models given projected compute availability
Make informed decisions about architecture choices and training duration

Anthropic used this predictive capability when scaling sparse autoencoders to production models, tuning methods at affordable scale before deploying on Claude Sonnet.¹¹

Model Design Decisions

Meta's development of LLaMA 3 revealed new observations on scaling behavior. Models trained on 15 trillion tokens continued improving log-linearly, demonstrating performance gains beyond Chinchilla-optimal training. The research also developed scaling laws for optimal data mix selection.¹²

Emergent Abilities and Scale

Definition and Examples

Wei et al. (2022) documented emergent abilities—capabilities not present in smaller models that appear in larger ones.¹³ These abilities manifest unpredictably at certain scale thresholds.

Key findings:

Chain of thought prompting only surpasses standard prompting when scaled to approximately 10²³ training FLOPs (roughly 100 billion parameters)
Specialized prompting or finetuning methods can be emergent, showing no positive effects until certain model scales
Abilities appear suddenly rather than gradually as model size increases

Measurement Controversy

Schaeffer et al. (2023) challenged the emergent abilities framework, arguing that apparent emergence results from metric choice rather than fundamental model behavior changes.¹⁴

Their analysis found:

Nonlinear or discontinuous metrics (like exact match accuracy) produce apparent emergent abilities
Linear or continuous metrics (like token edit distance) reveal smooth, predictable improvements
Of 29 examined metrics, 25 showed no emergent properties and revealed continuous, linear growth with scale

This suggests that emergence may partially reflect measurement artifacts rather than fundamental capability discontinuities.

Limitations and Breakdown Points

Broken Power Laws

Caballero et al. (2022) proposed "smoothly broken power laws" to better model complex scaling behaviors.¹⁵ This approach:

Captures phenomena like double descent that traditional power laws cannot express
Models scaling as linear segments connected by smooth breaks on log-log plots
Yields extrapolations with RMSE 0.86 times that of traditional power law methods

Double Descent

OpenAI documented the double descent phenomenon in 2019: as parameters increase, test error initially decreases, increases (overfitting), then undergoes a second descent.¹⁶ This contradicts classical assumptions about overfitting and reveals non-monotonic scaling behavior.

Hierarchy Loss

Bahri et al. (2021) observed breakdown of predicted scaling behavior when model parameters become very large and structural hierarchy is lost.¹⁷ This suggests fundamental limits may exist to simple power-law extrapolation.

Test-Time Compute Scaling

Inference-Time Reasoning

OpenAI's o1 model introduced a new scaling dimension: performance improvements from increased test-time compute. The model's accuracy on AIME 2024 demonstrates this effect:¹⁸

Single sample: 74% accuracy
Consensus among 64 samples: 83% accuracy
Re-ranking 1,000 samples: 93% accuracy

This represents a shift from improving model training to improving inference-time computation through extended reasoning processes.¹⁹

Test-Time Compute Methods

Models like o1 dynamically increase reasoning time during inference, spending more time on complex questions. This improves accuracy at the cost of higher computational expense. Implementations use step-by-step chain-of-thought reasoning to explore solution spaces more thoroughly.²⁰

Scaling Beyond Language Models

World Models and Agents

Recent research demonstrates that power laws extend beyond language modeling to world models and imitation learning.²¹ Performance of embodied agents improves with increases in model parameters, dataset size, and compute across domains from robotics to video games.

However, coefficients are heavily influenced by tokenizer choice, task characteristics, and architecture, limiting direct transfer of language model scaling laws to other domains.²¹

Vision and Multimodal Models

Scaling law principles apply to image classification, neural machine translation, and other modalities, though specific exponents and relationships differ.²² Research provides methods for estimating scaling law parameters reliably from learning curves across diverse architecture families.

Mixture of Experts

Scaling laws for Mixture-of-Experts (MoE) architectures differ from dense transformers. Recent work suggests:²³

Optimal number of activated experts is approximately 7 for classical MoE settings
Contrary to earlier results, MoE architectures can be more efficient than dense transformers regardless of model size
Expert granularity significantly affects optimal configurations

Transfer Learning and Fine-tuning

Transfer Scaling Laws

Research on transfer learning reveals that pre-training data size and fine-tuning data size both influence downstream performance through distinct scaling relationships.²⁴ The "transfer gap" measures the degree of transfer from pre-training to downstream distributions.

Measuring this gap from less expensive pre-training runs can predict improvements from more costly downstream training.²⁴

Fine-tuning Dataset Size

The choice of pre-training data and its size affect downstream task performance. Distribution alignment between pre-training and downstream data significantly influences scaling behavior, with implications for how much fine-tuning data is needed to achieve target performance.²⁵

Data Quality vs. Quantity

Compute-Dependent Filtering

Recent work challenges the assumption that data filtering decisions can be made independently of compute budget.²⁶ Key findings:

When training with low compute budgets, data quality produces better results
At larger computing scales, diminishing utility of limited high-quality data makes quantity increasingly important
Optimal filtering strategies should account for the computational budget

Data Distribution Effects

Test error scaling depends on properties of the data distribution. Below a critical threshold, power-law distributions of subtask difficulties govern scaling. Above this threshold, a single dominant structure controls the relationship between error and scale.²⁷

Inference Cost Scaling

Economic Implications

While training costs are incurred once, inference costs accumulate over a model's deployment lifetime. Epoch AI analysis indicates:²⁸

Inference latency scales with the square root of model size and cube root of memory bandwidth
Cumulative inference costs can equal or exceed initial training costs for widely deployed models
Inference revenue at major AI companies grows at 3x per year or more

This means compute-optimal training must consider not just training efficiency but also inference cost implications of model size choices.

Organizational Research Programs

OpenAI

OpenAI's 2020 scaling laws research directly influenced GPT-3's development and established the foundation for subsequent large language model development.²⁹ The organization continues researching test-time compute scaling with the o1 series.¹⁸

Google DeepMind

Google DeepMind's Chinchilla research (2022) revised understanding of compute-optimal training, showing that model parameters and training tokens should scale equally.⁵ This work analyzed over 400 models and demonstrated substantial efficiency gains.

Anthropic

Anthropic has applied scaling law principles to interpretability research, using them to guide sparse autoencoder training for Claude 3 Sonnet.³⁰ The organization used scaling laws to extrapolate from smaller experiments to production-grade models, successfully extracting interpretable features from a frontier model.¹¹

Open Questions and Future Directions

Several fundamental questions remain about scaling laws:

Scaling limits: Whether power-law relationships continue indefinitely or break down at larger scales
Architecture dependence: How different architectures (transformers, state space models, etc.) exhibit different scaling properties
Synthetic data: How scaling laws apply when training on model-generated data rather than human-produced data
Multimodal scaling: Optimal resource allocation across different modalities (text, image, video, audio)
Capability prediction: Whether specific capabilities (coding, reasoning, factual recall) scale predictably or exhibit emergence

The field continues to refine understanding of these relationships, with implications for AI development strategy and capability forecasting.

EA Forum community (2024). The Scaling Paradox. ↩
Kaplan, J., et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361. ↩ ↩² ↩³
Vet, J. (2024). A brief history of LLM Scaling Laws and what to expect in 2025. ↩ ↩²
Brenndoerfer, M. (2024). GPT-3: Scale, Few-Shot Learning & In-Context Learning Discovery. ↩
Hoffmann, J., et al. (2022). Training Compute-Optimal Large Language Models. arXiv:2203.15556. ↩ ↩² ↩³ ↩⁴ ↩⁵
Sun, P. (2024). AI Scaling Laws Explained. ↩
Brenndoerfer, M. (2024). Chinchilla Scaling Laws: Compute-Optimal Training. ↩ ↩² ↩³
DeepMind (2022). An empirical analysis of compute-optimal large language model training. ↩
Wikipedia contributors (2024). Neural scaling law. ↩ ↩² ↩³
Bahri, Y., et al. (2024). Explaining neural scaling laws. PNAS. ↩
Anthropic (2024). Mapping the Mind of a Large Language Model. ↩ ↩²
Meta AI (2024). Introducing Meta Llama 3: The most capable openly available LLM to date. ↩ ↩²
Wei, J., et al. (2022). Emergent Abilities of Large Language Models. arXiv:2206.07682. ↩
Schaeffer, R., et al. (2023). Are Emergent Abilities of Large Language Models a Mirage?. arXiv:2304.15004. ↩
Caballero, E., et al. (2022). Broken Neural Scaling Laws. arXiv:2210.14891. ↩
OpenAI (2019). Deep double descent. ↩
Bahri, Y., et al. (2021). Explaining Neural Scaling Laws. arXiv:2102.06701. ↩
OpenAI (2024). Learning to reason with LLMs. ↩ ↩²
EA Forum community (2024). Inference Scaling and the Log-x Chart. ↩
Hugging Face (2024). What is test-time compute and how to scale it? ↩
Various authors (2024). Scaling Laws for Pre-training Agents and World Models. ↩ ↩²
Various authors (2024). Revisiting Neural Scaling Laws in Language and Vision. ↩
Various authors (2025). Towards a Comprehensive Scaling Law of Mixture-of-Experts. ↩
Various authors (2024). An Empirical Study of Scaling Laws for Transfer. ↩ ↩²
Various authors (2024). Scaling Laws for Downstream Task Performance of Large Language Models. ↩
Various authors (2024). Scaling Laws for Data Filtering— Data Curation cannot be Compute Agnostic. ↩
Various authors (2024). Neural Scaling Laws Rooted in the Data Distribution. ↩
Epoch AI (2024). Inference economics of language models. ↩
OpenAI (2020). Scaling laws for neural language models. ↩
Anthropic (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. ↩

References

1Kaplan et al. (2020)arXiv·Jared Kaplan et al.·2020·Paper▸

Kaplan et al. (2020) empirically characterize scaling laws for language model performance, demonstrating that cross-entropy loss follows power-law relationships with model size, dataset size, and compute budget across seven orders of magnitude. The study reveals that architectural details like width and depth have minimal impact, while overfitting and training speed follow predictable patterns. Crucially, the findings show that larger models are significantly more sample-efficient, implying that optimal compute-efficient training involves training very large models on modest datasets and stopping before convergence.

★★★☆☆

arxiv.org

2Hoffmann et al. (2022)arXiv·Jordan Hoffmann et al.·2022·Paper▸

Hoffmann et al. (2022) investigates the optimal allocation of compute budgets between model size and training data for transformer language models. Through extensive experiments training over 400 models ranging from 70M to 16B parameters, the authors find that current large language models are significantly undertrained due to emphasis on model scaling without proportional increases in training data. They propose that compute-optimal training requires equal scaling of model size and training tokens—doubling model size should be accompanied by doubling training data. The authors validate this finding with Chinchilla (70B parameters), which matches Gopher's compute budget but uses 4× more data, achieving superior performance across downstream tasks and reaching 67.5% on MMLU, a 7% improvement over Gopher.

★★★☆☆

arxiv.org

3performance gap between US and Chinese modelsjonvet.com▸

A blog post analyzing the state of LLM scaling laws as of late 2024/early 2025, examining whether pre-training scaling has stalled and how post-training techniques and test-time compute scaling have driven recent progress. It contextualizes OpenAI's o3 breakthrough against a backdrop of pessimism about AI advancement and discusses the competitive landscape between US and Chinese AI labs.

jonvet.com

4Emergent AbilitiesarXiv·Jason Wei et al.·2022·Paper▸

This paper introduces the concept of 'emergent abilities' in large language models—capabilities that appear in larger models but are absent in smaller ones, making them unpredictable through simple extrapolation of smaller model performance. Unlike the generally predictable improvements from scaling, emergent abilities represent a discontinuous phenomenon where new capabilities suddenly manifest at certain model scales. The authors argue that this emergence suggests further scaling could unlock additional unforeseen capabilities in language models.

★★★☆☆

arxiv.org

5"Are Emergent Abilities a Mirage?"arXiv·Rylan Schaeffer, Brando Miranda & Sanmi Koyejo·2023·Paper▸

This paper argues that apparent emergent abilities in large language models are artifacts of metric choice rather than genuine phase transitions in model behavior. Using mathematical modeling and empirical analysis across GPT-3, BIG-Bench, and vision models, the authors show that nonlinear metrics create illusory sharp transitions while linear metrics reveal smooth, predictable scaling. The findings suggest emergent abilities may not be a fundamental property of AI scaling.

★★★☆☆

arxiv.org

6Learning to Reason with LLMs: OpenAI o1OpenAI▸

OpenAI introduces the o1 model series, which uses chain-of-thought reasoning during inference to significantly improve performance on complex tasks in science, math, and coding. The model is trained via reinforcement learning to 'think' before responding, producing a hidden reasoning trace. This represents a major capability advance, with safety implications around alignment and evaluation.

★★★★☆

openai.com

7Introducing Meta Llama 3: The most capable openly available LLM to dateMeta AI▸

Meta announces Llama 3, their most capable openly available large language model family, featuring 8B and 70B parameter models with improved reasoning, coding, and instruction-following capabilities. The release includes details on training data, architecture improvements, and safety measures implemented before public release. Llama 3 represents a significant milestone in open-weight frontier model development.

★★★★☆

ai.meta.com

8Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 SonnetTransformer Circuits·Paper▸

Anthropic applies sparse autoencoders (SAEs) to extract millions of interpretable, monosemantic features from Claude 3 Sonnet, a large production-scale language model. The work demonstrates that features learned at scale are human-interpretable, multimodal, and exhibit meaningful geometric structure including concept hierarchies. This represents a major step toward mechanistic interpretability of frontier AI systems.

★★★★☆

transformer-circuits.pub

9Mapping the Mind of a Large Language ModelAnthropic▸

Anthropic reports a major interpretability breakthrough: using scaled dictionary learning (sparse autoencoders), they identified millions of human-interpretable features inside Claude Sonnet, marking the first detailed look inside a production-grade LLM. The work demonstrates that concepts like emotions, ethics, and safety-relevant behaviors can be located and manipulated within the model's internal representations.

★★★★☆

anthropic.com

AI Scaling Laws

AI Scaling Laws

Overview

Key Variables

Kaplan Scaling Laws (2020)

Mathematical Formulation

Key Findings

Compute Allocation Strategy

Chinchilla Scaling Laws (2022)

Revised Compute-Optimal Training

Mathematical Relationship

Chinchilla Model Performance

Implications for Training Strategy

Power Law Relationships

Mathematical Structure

Four Scaling Regimes

Practical Applications

Compute Budget Allocation

Predictive Capability

Model Design Decisions

Emergent Abilities and Scale

Definition and Examples

Measurement Controversy

Limitations and Breakdown Points

Broken Power Laws

Double Descent

Hierarchy Loss

Test-Time Compute Scaling

Inference-Time Reasoning

Test-Time Compute Methods

Scaling Beyond Language Models

World Models and Agents

Vision and Multimodal Models

Mixture of Experts

Transfer Learning and Fine-tuning

Transfer Scaling Laws

Fine-tuning Dataset Size

Data Quality vs. Quantity

Compute-Dependent Filtering

Data Distribution Effects

Inference Cost Scaling

Economic Implications

Organizational Research Programs

OpenAI

Google DeepMind

Anthropic

Meta

Open Questions and Future Directions

Footnotes

References

Related Wiki Pages

Top Related Pages

Epoch AI

AI Compute Scaling Metrics

Anthropic

OpenAI

AI Training Data Constraints

Approaches

Analysis

Other

Organizations

Concepts

Key Debates