Will we run out of data? Limits of LLM scaling based on human-generated data

paper

2022·arXiv·arxiv.org/html/2211.04325v2

Authors

Pablo Villalobos·Anson Ho·Jaime Sevilla·Tamay Besiroglu·Lennart Heim·Marius Hobbhahn

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Investigates data scarcity constraints on LLM scaling, forecasting that public human-generated text data may be exhausted by 2028-2032, which has critical implications for long-term AI development sustainability and safety considerations around model training limitations.

Paper Details

Citations

213

7 influential

Year

2022

arXiv:2211.04325 Semantic Scholar

Metadata

arxiv preprintanalysis

Abstract

We investigate the potential constraints on LLM scaling posed by the availability of public human-generated text data. We forecast the growing demand for training data based on current trends and estimate the total stock of public human text data. Our findings indicate that if current LLM development trends continue, models will be trained on datasets roughly equal in size to the available stock of public human text data between 2026 and 2032, or slightly earlier if models are overtrained. We explore how progress in language modeling can continue when human-generated text datasets cannot be scaled any further. We argue that synthetic data generation, transfer learning from data-rich domains, and data efficiency improvements might support further progress.

Summary

This paper investigates whether the availability of public human-generated text data will constrain the scaling of large language models (LLMs). The authors forecast training data demand based on current scaling trends and estimate the total stock of publicly available human text, finding that if current development trajectories continue, models will exhaust the available stock of public human text data between 2026 and 2032 (median estimate: 2028). The paper explores potential solutions to overcome this data constraint, including synthetic data generation, transfer learning from data-rich domains, and improved data efficiency.

Cited by 1 page

Page	Type	Quality
AI Timelines	Concept	95.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202698 KB

Will we run out of data? Limits of LLM scaling based on human-generated data 
 
 
 
 
 
 

 
 
 

 
 
 
 
 Will we run out of data? Limits of LLM scaling based on human-generated data

 
 
 Pablo Villalobos
 
    
 Anson Ho
 
    
 Jaime Sevilla
 
    
 Tamay Besiroglu
 
    
 Lennart Heim
 
    
 Marius Hobbhahn
 
 
 
 Abstract

 We investigate the potential constraints on LLM scaling posed by the availability of public human-generated text data. We forecast the growing demand for training data based on current trends and estimate the total stock of public human text data. Our findings indicate that if current LLM development trends continue, models will be trained on datasets roughly equal in size to the available stock of public human text data between 2026 and 2032, or slightly earlier if models are overtrained. We explore how progress in language modeling can continue when human-generated text datasets cannot be scaled any further. We argue that synthetic data generation, transfer learning from data-rich domains, and data efficiency improvements might support further progress.

 
 Machine Learning, ICML
 
 
 
 
 
 Figure 1 : Projections of the effective stock of human-generated public text and dataset sizes used to train notable LLMs. The intersection of the stock and dataset size projection lines indicates the median year (2028) in which the stock is expected to be fully utilized if current LLM development trends continue. At this point, models will be trained on dataset sizes approaching the total effective stock of text in the indexed web: around 4e14 tokens, corresponding to training compute of ∼ \sim 5e28 FLOP for non-overtrained models. Individual dots represent dataset sizes of specific notable models. The model is explained in Section 2 
 
 
 
 1 Introduction

 
 Recent progress in language modeling has relied heavily on unsupervised training on vast amounts of human-generated text, primarily sourced from the web or curated corpora (Zhao et al., 2023 ) . The largest datasets of human-generated public text data , such as RefinedWeb, C4, and RedPajama, contain tens of trillions of words collected from billions of web pages (Penedo et al., 2023 ; Together.ai, 2023 ) .

 
 
 The demand for public human text data is likely to continue growing. In order to scale the size of models and training runs efficiently, large language models (LLMs) are typically trained according to neural scaling laws (Kaplan et al., 2020 ; Hoffmann et al., 2022 ) . These relationships imply that increasing the size of training datasets is crucial for efficiently improving the performance of LLMs.

 
 
 
 In this paper, we argue that human-generated public text data cannot sustain scaling beyond this decade. To support this conclusion, we develop a model of the growing demand for training data and the production of public human text data. We use this model to predict when the trajectory of LLM development will fully exhaust the available stock of public human text

... (truncated, 98 KB total)

Resource ID: 5dee98c481614176 | Stable ID: sid_eh6mvmOF18