Back
Will we run out of data? Limits of LLM scaling based on human-generated data
paperarxiv.org·arxiv.org/html/2211.04325v2
Data Status
Not fetched
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| AI Timelines | Concept | 95.0 |
Cached Content Preview
HTTP 200Fetched Feb 26, 2026268 KB
Will we run out of data? Limits of LLM scaling based on human-generated data Will we run out of data? Limits of LLM scaling based on human-generated data Pablo Villalobos Anson Ho Jaime Sevilla Tamay Besiroglu Lennart Heim Marius Hobbhahn Abstract We investigate the potential constraints on LLM scaling posed by the availability of public human-generated text data. We forecast the growing demand for training data based on current trends and estimate the total stock of public human text data. Our findings indicate that if current LLM development trends continue, models will be trained on datasets roughly equal in size to the available stock of public human text data between 2026 and 2032, or slightly earlier if models are overtrained. We explore how progress in language modeling can continue when human-generated text datasets cannot be scaled any further. We argue that synthetic data generation, transfer learning from data-rich domains, and data efficiency improvements might support further progress. Machine Learning, ICML Figure 1 : Projections of the effective stock of human-generated public text and dataset sizes used to train notable LLMs. The intersection of the stock and dataset size projection lines indicates the median year (2028) in which the stock is expected to be fully utilized if current LLM development trends continue. At this point, models will be trained on dataset sizes approaching the total effective stock of text in the indexed web: around 4e14 tokens, corresponding to training compute of ∼ \sim 5e28 FLOP for non-overtrained models. Individual dots represent dataset sizes of specific notable models. The model is explained in Section 2 1 Introduction Recent progress in language modeling has relied heavily on unsupervised training on vast amounts of human-generated text, primarily sourced from the web or curated corpora (Zhao et al., 2023 ) . The largest datasets of human-generated public text data , such as RefinedWeb, C4, and RedPajama, contain tens of trillions of words collected from billions of web pages (Penedo et al., 2023 ; Together.ai, 2023 ) . The demand for public human text data is likely to continue growing. In order to scale the size of models and training runs efficiently, large language models (LLMs) are typically trained according to neural scaling laws (Kaplan et al., 2020 ; Hoffmann et al., 2022 ) . These relationships imply that increasing the size of training datasets is crucial for efficiently improving the performance of LLMs. In this paper, we argue that human-generated public text data cannot sustain scaling beyond this decade. To support this conclusion, we develop a model of the growing demand for training data and the production of public human text data. We use this model to predict when the trajectory of LLM development will fully exhaust the available stock of public human text data. We then explore a range of potential strategies to circumvent this constraint, such as synthetic data generation, trans
... (truncated, 268 KB total)Resource ID:
5dee98c481614176 | Stable ID: ODA3YmJmN2