Kaplan et al. (2020)
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Data Status
Abstract
We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.
Cited by 8 pages
| Page | Type | Quality |
|---|---|---|
| Large Language Models | Capability | 60.0 |
| Dense Transformers | Concept | 58.0 |
| Capability-Alignment Race Model | Analysis | 62.0 |
| Power-Seeking Emergence Conditions Model | Analysis | 63.0 |
| AI Scaling Laws | Concept | 92.0 |
| OpenAI | Organization | 62.0 |
| AI Proliferation | Risk | 60.0 |
| AI Winner-Take-All Dynamics | Risk | 54.0 |
Cached Content Preview
[2001.08361] Scaling Laws for Neural Language Models
-->
Computer Science > Machine Learning
arXiv:2001.08361 (cs)
[Submitted on 23 Jan 2020]
Title: Scaling Laws for Neural Language Models
Authors: Jared Kaplan , Sam McCandlish , Tom Henighan , Tom B. Brown , Benjamin Chess , Rewon Child , Scott Gray , Alec Radford , Jeffrey Wu , Dario Amodei View a PDF of the paper titled Scaling Laws for Neural Language Models, by Jared Kaplan and 9 other authors
View PDF
Abstract: We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.
Comments:
19 pages, 15 figures
Subjects:
Machine Learning (cs.LG) ; Machine Learning (stat.ML)
Cite as:
arXiv:2001.08361 [cs.LG]
(or
arXiv:2001.08361v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2001.08361
Focus to learn more
arXiv-issued DOI via DataCite
Submission history
From: Samuel McCandlish [ view email ]
[v1]
Thu, 23 Jan 2020 03:59:20 UTC (1,520 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled Scaling Laws for Neural Language Models, by Jared Kaplan and 9 other authors View PDF
TeX Source
view license
Current browse context: cs.LG
< prev
|
next >
new
|
recent
| 2020-01
Change to browse by:
cs
stat
stat.ML
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
15 blog links
( what is this? )
DBLP - CS Bibliography
listing | bibtex
Jared Kaplan
Sam McCandlish
Tom B. Brown
Rewon Child
Scott Gray …
export BibTeX citation
Loading...
BibTeX formatted citation
×
loading...
Data provided by:
Bookmark
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer ( What is the Explorer? )
Connected Papers Toggle
Connected Papers ( What is Connected Papers? )
Litmaps Toggle
Litmaps ( What is Litmaps? )
scite.ai T
... (truncated, 5 KB total)85f66a6419d173a7 | Stable ID: YTM5OGYyMG