Data movement bottlenecks to large-scale model training

web

Epoch AI·epoch.ai/blog/data-movement-bottlenecks-scaling-past-1e28...

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Epoch AI

Relevant to AI safety timelines and compute governance discussions, as it identifies near-term physical constraints on AI scaling that could affect projections for when transformative AI systems might be developed.

Metadata

Importance: 72/100blog postanalysis

Summary

This Epoch AI paper analyzes fundamental data movement constraints that limit LLM training scaling, finding that training runs beyond ~2e28 FLOP become inefficient with current hardware due to data movement time dominating arithmetic time. A 'latency wall' is projected around 2e31 FLOP, potentially reachable within ~3 years. Aggressive batch size scaling is identified as a possible mitigation strategy.

Key Points

•Training runs beyond 2e28 FLOP are projected to be infeasible with current technology due to data movement bottlenecks, assuming a 3-month maximum training duration.
•A harder 'latency wall' at 2e31 FLOP represents a more fundamental limit where even architectural changes may not suffice.
•Frontier AI training compute has grown 4-5x per year from 2010-2024, far outpacing Moore's law, making these limits potentially reachable in ~3 years.
•Scaling training clusters (more GPUs) is the primary path for compute scaling but creates proportionally greater data movement demands both intra- and inter-GPU.
•Aggressive batch size scaling is proposed as a potential approach to overcome data movement bottlenecks and extend the scaling frontier.

Cited by 1 page

Page	Type	Quality
AI Timelines	Concept	95.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202623 KB

Data movement bottlenecks to large-scale model training: Scaling past 1e28 FLOP | Epoch AI 

 
 
 
 

 

 
 

 
 Introduction

 Over the past five years, the performance of large language models (LLMs) has improved dramatically, driven largely by rapid scaling in training compute budgets to handle larger models and training datasets. Our own estimates suggest that the training compute used by frontier AI models has grown by 4-5 times every year from 2010 to 2024. This rapid pace of scaling far outpaces Moore’s law, and sustaining it has required scaling along three dimensions: First, making training runs last longer; second, increasing the number of GPUs participating in each training run; and third, utilizing more performant GPUs.

 It’s relatively easy to scale the duration that a GPU cluster is used to train a model. 1 However, in practice training runs rarely exceed 6 months . This is because both the hardware and software used for a training run risks becoming obsolete at timescales longer than this, and no lab would want to release a model which has become outdated immediately upon release. This sets a practical limit on how long training runs can become.

 The alternative is to scale the size of training clusters, and this has been the primary way in which training compute scaling has been achieved. As an example, the original Transformer from Vaswani et al. (2017) was trained on 8 GPUs for 12 hours, while Llama 3.1 405B was trained on around 16,000 GPUs for around 2 months.

 Scaling the sizes of training clusters, however, is not as easy as scaling the length of training runs: there are fundamental constraints arising from data movement to how many GPUs can effectively participate in a training run of bounded duration. The more GPUs participate in a single training run, all else equal, the more data has to be moved around per second both inside each individual GPU and also across different GPUs.

 In a recent paper by Epoch AI researcher Ege Erdil and collaborator David Schneider-Joseph, we analyze the point at which data movement bottlenecks are likely to preclude further scaling of training runs with present technology. According to the estimates in our paper, if we assume a maximum duration of 3 months, training runs past a scale of \(2e28 \text{ FLOP}\) are infeasible to do efficiently. This is because past this training compute scale the time taken for data movement begins to dominate the time taken for arithmetic and it becomes impossible to make efficient use of hardware.

 Enable JavaScript to see an interactive visualization.

 Figure 1. Training compute for top ML models is nearing the limits of current GPU technology, with data movement bottlenecks presenting a challenge. Current training runs are about 100x away from the limit beyond which scaling will face substantially reduced utilization. (Note that despite the H100’s higher arithmetic throughput, its maximum efficient training scale is actually lower than the A100’s due t

... (truncated, 23 KB total)

Resource ID: 5c30eb3fdb1f6437 | Stable ID: ZDFhZjk3Mj