Longterm Wiki
Back

Data movement bottlenecks to large-scale model training

web

Data Status

Not fetched

Cited by 1 page

PageTypeQuality
AI TimelinesConcept95.0

Cached Content Preview

HTTP 200Fetched Feb 26, 2026278 KB
Data movement bottlenecks to large-scale model training: Scaling past 1e28 FLOP | Epoch AI Latest Publications & Commentary Papers & Reports Newsletter Podcast Data & Resources Datasets Overview Benchmarking Models Frontier Data Centers Hardware Companies Chip Sales Polling Resources AI Trends & Statistics Data Insights Projects FrontierMath GATE Playground Distributed Training Model Counts About About Us Our Team Careers Consultations Our Funding Donate Contact Search epoch.ai Search Enter a query to search for results Placeholder Article Data movement bottlenecks to large-scale model training: Scaling past 1e28 FLOP paper Data movement bottlenecks to large-scale model training: Scaling past 1e28 FLOP Data movement bottlenecks limit LLM scaling beyond 2e28 FLOP, with a "latency wall" at 2e31 FLOP. We may hit these in ~3 years. Aggressive batch size scaling could potentially overcome these limits. Cite Published Nov 2, 2024 Authors Ege Erdil Resources Paper Introduction Over the past five years, the performance of large language models (LLMs) has improved dramatically, driven largely by rapid scaling in training compute budgets to handle larger models and training datasets. Our own estimates suggest that the training compute used by frontier AI models has grown by 4-5 times every year from 2010 to 2024. This rapid pace of scaling far outpaces Moore’s law, and sustaining it has required scaling along three dimensions: First, making training runs last longer; second, increasing the number of GPUs participating in each training run; and third, utilizing more performant GPUs. It’s relatively easy to scale the duration that a GPU cluster is used to train a model. 1 However, in practice training runs rarely exceed 6 months . This is because both the hardware and software used for a training run risks becoming obsolete at timescales longer than this, and no lab would want to release a model which has become outdated immediately upon release. This sets a practical limit on how long training runs can become. The alternative is to scale the size of training clusters, and this has been the primary way in which training compute scaling has been achieved. As an example, the original Transformer from Vaswani et al. (2017) was trained on 8 GPUs for 12 hours, while Llama 3.1 405B was trained on around 16,000 GPUs for around 2 months. Scaling the sizes of training clusters, however, is not as easy as scaling the length of training runs: there are fundamental constraints arising from data movement to how many GPUs can effectively participate in a training run of bounded duration. The more GPUs participate in a single training run, all else equal, the more data has to be moved around per second both inside each individual GPU and also across different GPUs. In a recent paper by Epoch AI researcher Ege Erdil and collaborator David Schneider-Joseph, we analyze the point at which data movement bottlenecks are likely to preclude further scaling of training runs with pr

... (truncated, 278 KB total)
Resource ID: 5c30eb3fdb1f6437 | Stable ID: ZDFhZjk3Mj