Longterm Wiki
Back

Kaplan et al. (2020)

paper

Authors

Jared Kaplan·Sam McCandlish·Tom Henighan·Tom B. Brown·Benjamin Chess·Rewon Child·Scott Gray·Alec Radford·Jeffrey Wu·Dario Amodei

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Data Status

Not fetched

Abstract

We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.

Cited by 8 pages

Cached Content Preview

HTTP 200Fetched Feb 22, 20265 KB
[2001.08361] Scaling Laws for Neural Language Models 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 

 
 
 
 
 
--> 

 
 
 Computer Science > Machine Learning

 

 
 arXiv:2001.08361 (cs)
 
 
 
 
 
 [Submitted on 23 Jan 2020] 
 Title: Scaling Laws for Neural Language Models

 Authors: Jared Kaplan , Sam McCandlish , Tom Henighan , Tom B. Brown , Benjamin Chess , Rewon Child , Scott Gray , Alec Radford , Jeffrey Wu , Dario Amodei View a PDF of the paper titled Scaling Laws for Neural Language Models, by Jared Kaplan and 9 other authors 
 View PDF 

 
 Abstract: We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.
 

 
 
 
 Comments: 
 19 pages, 15 figures 
 
 
 Subjects: 
 
 Machine Learning (cs.LG) ; Machine Learning (stat.ML) 
 
 Cite as: 
 arXiv:2001.08361 [cs.LG] 
 
 
 
 (or 
 arXiv:2001.08361v1 [cs.LG] for this version)
 
 
 
 
 https://doi.org/10.48550/arXiv.2001.08361 
 
 
 Focus to learn more 
 
 
 
 arXiv-issued DOI via DataCite 
 
 
 
 
 
 
 
 Submission history

 From: Samuel McCandlish [ view email ] 
 [v1] 
 Thu, 23 Jan 2020 03:59:20 UTC (1,520 KB)

 
 
 
 
 
 Full-text links: 
 Access Paper:

 
 
View a PDF of the paper titled Scaling Laws for Neural Language Models, by Jared Kaplan and 9 other authors View PDF 
 TeX Source
 
 
 view license 
 
 
 Current browse context: cs.LG 

 
 
 < prev 
 
 | 
 next > 
 

 
 new 
 | 
 recent 
 | 2020-01 
 
 Change to browse by:
 
 cs 
 stat 
 stat.ML 
 
 

 
 
 References & Citations

 
 NASA ADS 
 Google Scholar 

 Semantic Scholar 

 
 
 

 
 
 15 blog links 

 ( what is this? )
 
 
 
 DBLP - CS Bibliography

 
 listing | bibtex 
 
 Jared Kaplan 
 Sam McCandlish 
 Tom B. Brown 
 Rewon Child 
 Scott Gray &hellip; 
 
 
 export BibTeX citation 
 Loading... 
 

 
 
 
 BibTeX formatted citation

 &times; 
 
 
 loading... 
 
 
 Data provided by: 
 
 
 
 
 Bookmark

 
 
 
 
 
 
 
 
 
 
 
 Bibliographic Tools 
 
 Bibliographic and Citation Tools

 
 
 
 
 
 
 Bibliographic Explorer Toggle 
 
 
 
 Bibliographic Explorer ( What is the Explorer? ) 
 
 
 
 
 
 
 
 Connected Papers Toggle 
 
 
 
 Connected Papers ( What is Connected Papers? ) 
 
 
 
 
 
 
 Litmaps Toggle 
 
 
 
 Litmaps ( What is Litmaps? ) 
 
 
 
 
 
 
 
 scite.ai T

... (truncated, 5 KB total)
Resource ID: 85f66a6419d173a7 | Stable ID: YTM5OGYyMG