Skip to content

Compute Thresholds

📋Page Status
Page Type:ResponseStyle Guide →Intervention/response page
Quality:91 (Comprehensive)
Importance:78.5 (High)
Last edited:2026-01-30 (2 days ago)
Words:4.1k
Structure:
📊 13📈 2🔗 15📚 521%Score: 15/15
LLM Summary:Comprehensive analysis of compute thresholds (EU: 10^25 FLOP, US: 10^26 FLOP) as regulatory triggers for AI governance, documenting that algorithmic efficiency improvements of ~2x every 8-17 months threaten to make static thresholds obsolete within 3-5 years. Training costs range from $7-10M at 10^25 FLOP to $70-100M at 10^26 FLOP, with only 5-15 companies globally currently captured. Identifies key evasion strategies (distillation, jurisdictional arbitrage, inference scaling up to 10,000x) and provides quantified forecasts showing absolute thresholds will capture 100-200 models by 2028 versus 14-16 for relative thresholds.
Critical Insights (5):
  • Quant.Algorithmic efficiency improvements of approximately 2x per year threaten to make static compute thresholds obsolete within 3-5 years, as models requiring 10^25 FLOP in 2023 could achieve equivalent performance with only 10^24 FLOP by 2026.S:4.0I:4.5A:4.0
  • GapThe shift to inference-time scaling (demonstrated by models like OpenAI's o1) fundamentally undermines compute threshold governance, as models trained below thresholds can achieve above-threshold capabilities through deployment-time computation.S:4.0I:4.5A:4.0
  • Quant.The number of models exceeding absolute compute thresholds will grow superlinearly from 5-10 models in 2024 to 100-200 models in 2028, potentially creating regulatory capacity crises for agencies unprepared for this scaling challenge.S:3.5I:4.0A:4.5
Issues (1):
  • Links22 links could use <R> components
See also:LessWrong
Policy

Compute Thresholds

ApproachDefine capability boundaries via compute
StatusEstablished in US and EU policy

Compute thresholds represent one of the most concrete regulatory approaches to AI governance implemented to date, using training compute as a measurable trigger for safety and transparency requirements. Unlike export controls that restrict access or monitoring systems that provide ongoing visibility, thresholds create a simple binary rule: if you train a model above X floating-point operations (FLOP), you must comply with specific regulatory obligations.

This approach has gained traction because compute is both measurable and correlates with model capabilities, albeit imperfectly. The European Union’s AI Act established a 10^25 FLOP threshold in 2024, while the US Executive Order on AI set a 10^26 FLOP trigger in October 2023. These implementations represent the first large-scale attempt to regulate AI development based on resource consumption rather than demonstrated capabilities or actor identity.

However, compute thresholds face a fundamental challenge: algorithmic efficiency improvements of approximately 2x per year are decoupling compute requirements from capabilities. A model requiring 10^25 FLOP in 2023 might achieve equivalent performance with only 10^24 FLOP by 2026, potentially making static thresholds obsolete within 3-5 years. This creates an ongoing tension between the tractability of compute-based triggers and their diminishing relevance as a proxy for AI capabilities and associated risks.

DimensionAssessmentEvidence
Regulatory AdoptionHighEU AI Act (10^25 FLOP, effective Aug 2025), US EO 14110 (10^26 FLOP, active since Oct 2023)
Cost to Trigger$7-100M+Training at 10^25 FLOP costs $7-10M; at 10^26 FLOP costs $70-100M (Epoch AI)
Models Currently Captured5-15 globallyGPT-4, Gemini 1.5 Pro, Claude 3.7, Grok 3, Llama 3 (EU AI Office)
Algorithmic Efficiency Erosion2x every 8-17 monthsCompute required for given capability halving roughly every 8-17 months (OpenAI)
Inference Scaling GapCriticalModels below thresholds can achieve 10^3-10^5x capability gains at inference time (GovAI)
Evasion DifficultyLow-MediumDistillation, jurisdictional arbitrage, and fine-tuning can circumvent thresholds (Fenwick)
Threshold Shelf Life3-5 yearsStatic thresholds will capture 100-200 models by 2028 vs. intended frontier focus (GovAI)
RiskMechanismEffectiveness
Racing DynamicsForces safety testing before deploymentMedium
BioweaponsLower thresholds for bio-sequence modelsMedium
Deceptive AlignmentRequires evaluation before deploymentLow-Medium

The following table compares compute threshold implementations across major jurisdictions, revealing significant variation in both threshold levels and triggered requirements:

JurisdictionThresholdScopeKey RequirementsStatusSource
EU AI Act10^25 FLOPGPAI with systemic riskTransparency, risk evaluation, incident reporting, adversarial testingEffective Aug 2025EC Guidelines
US EO 1411010^26 FLOPGeneral AI systemsPre-training notification, safety testing, security measuresActive (Oct 2023)Commerce reporting
US EO 1411010^23 FLOPBiological sequence modelsSame as above, lower threshold for bio-riskActive (Oct 2023)Commerce reporting
China Draft AI LawNot yet specified”Critical AI” systemsAssessment and approval before market deploymentDraft stageAsia Society
UK AISICapability-basedFrontier modelsVoluntary evaluation, no formal thresholdMonitoring onlyAISI Framework

The 1000x difference between the US biological threshold (10^23) and general threshold (10^26) reflects assessment that biological capabilities may emerge at much smaller model scales. The EU’s 10^25 threshold sits between these extremes, calibrated to capture approximately GPT-4-scale models while excluding smaller systems.

Estimated Training Costs by Threshold Level

Section titled “Estimated Training Costs by Threshold Level”

Understanding the economic implications of compute thresholds requires examining the relationship between FLOP thresholds and actual training costs. Epoch AI research provides detailed cost breakdowns:

Compute LevelEstimated Training CostExample ModelsRegulatory StatusCost Breakdown
10^23 FLOP$70K-$100KGPT-3-scale, early LLMsUS EO bio-threshold triggerHardware: 47-67%, Staff: 29-49%, Energy: 2-6%
10^24 FLOP$700K-$1MLlama 2-65B, MistralBelow major thresholdsAccessible to well-funded startups
10^25 FLOP$7M-$10MGPT-4, Claude 3 OpusEU AI Act GPAI systemic riskRequires major corporate backing
10^26 FLOP$70M-$100MProjected next-gen frontierUS EO 14110 triggerOnly 10-15 organizations globally
10^27 FLOP$700M-$1B+Projected 2027-2028 frontierNot yet reachedGrok-4 estimated at $480M (Epoch AI)

Key insight: Training costs have grown by approximately 2.4x per year since 2020, suggesting frontier models will exceed $1 billion by 2027. This concentration of capability among well-capitalized organizations is itself a form of implicit access control—but compute thresholds provide transparency about which actors are operating at these scales.

EU AI Act Foundation Models Regulation (2024)

Section titled “EU AI Act Foundation Models Regulation (2024)”

The EU AI Act, which entered into force in August 2024, establishes the most comprehensive compute threshold regime to date. According to the European Commission’s guidelines, models trained with more than 10^25 FLOP are classified as General Purpose AI (GPAI) systems with systemic risk, triggering substantial obligations:

Obligation CategorySpecific RequirementsCompliance Deadline
TransparencyTraining data documentation, model card publicationAugust 2025
Risk EvaluationSystemic risk assessment, adversarial testingAugust 2025
Incident ReportingMandatory reporting of safety incidentsAugust 2025
NotificationNotify AI Office within 2 weeks of crossing thresholdImmediate
Downstream ModifiersOne-third threshold (10^24.5 FLOP) for fine-tuning systemic risk modelsAugust 2025

The 10^25 FLOP threshold was calibrated to capture models at roughly GPT-4’s training scale, which required approximately 2-5 × 10^25 FLOP based on available estimates. According to CSET Georgetown analysis, the Safety and Security chapter of the Code of Practice only applies to providers of GPAI models with systemic risk—currently a small group of 5-15 companies worldwide including providers of GPT-4, Gemini 1.5 Pro, Claude 3.7 Sonnet, and Grok 3.

The implementation timeline requires full compliance by August 2025, with providers placing GPAI models on the market before this date having until August 2027 to comply. Notably, providers can contest classification by demonstrating that despite surpassing the compute threshold, their model does not possess “high-impact capabilities” matching the most advanced models.

The United States took a different approach with Executive Order 14110, setting a higher threshold of 10^26 FLOP for general AI systems while establishing a much lower 10^23 FLOP threshold specifically for models trained primarily on biological sequence data. According to Stanford HAI analysis, researchers estimated that the 10^26 threshold is “more than any model trained to date” with GPT-4 just under this threshold.

RequirementDetailsTrigger
Pre-Training NotificationReport ongoing or planned training activities to CommerceBefore training begins
Safety TestingConduct red-teaming; share results with governmentBefore and after deployment
Security MeasuresProtect model weights and training infrastructureOngoing
Computing Cluster ReportingClusters greater than 10^20 OP/s with greater than 300 Gbit/s networkingUpon acquisition

The dual-threshold approach reflects differentiated risk assessment, with the biological threshold set at roughly GPT-3 scale (10^23 FLOP) to capture potential bioweapon development risks at lower capability levels. According to Mayer Brown analysis, the Bureau of Industry and Security assesses that no more than 15 companies currently exceed the reporting thresholds for models and computing clusters.

Notable implementations include Meta’s reporting of Llama 3 training (estimated ~4 × 10^25 FLOP) and OpenAI’s compliance with pre-training notification requirements. The Department of Commerce has established preliminary reporting mechanisms, though as noted by the Institute for Law & AI, the executive order was among those revoked by President Trump upon entering office, creating uncertainty about US threshold policy continuity.

The UK has taken a more cautious approach, with the Frontier AI Taskforce (now AI Safety Institute) monitoring compute thresholds without establishing formal regulatory triggers. China’s approach remains opaque, though draft regulations suggest consideration of compute-based measures alongside capability assessments. The result is a fragmented global landscape where companies must navigate multiple threshold regimes with different requirements and measurement standards.

Compute thresholds operate through a multi-stage regulatory pipeline that begins before training commences. The typical sequence involves threshold definition by regulators, pre-training notification by AI developers, threshold crossing triggering specific requirements, mandatory evaluation and testing phases, implementation of required safeguards, and finally authorized deployment under ongoing monitoring.

Loading diagram...

This pipeline structure is designed to provide regulatory visibility into AI development before capabilities emerge, rather than reacting after deployment. However, implementation varies significantly between jurisdictions, with the EU emphasizing post-training compliance verification while the US focuses on pre-training notification and ongoing cooperation.

Pre-training requirements typically include notification of training intent, security measures for training infrastructure, and preliminary risk assessments. Pre-deployment obligations encompass comprehensive safety evaluations including red-teaming exercises, capability testing across multiple domains, detailed risk assessments, and extensive documentation of training processes and data sources.

Ongoing requirements extend throughout the model lifecycle, including incident reporting for safety failures or misuse, monitoring systems for detecting problematic applications, cooperation with regulatory investigations, and periodic compliance audits. The breadth of these requirements reflects the challenge of governing AI systems whose capabilities and risks may emerge or change after initial deployment.

The most fundamental challenge facing compute thresholds is the rapid improvement in algorithmic efficiency, which threatens to make static thresholds increasingly irrelevant. Research by Epoch AI documents that training compute of frontier AI models has grown by 4-5x per year since 2010, while OpenAI research found that since 2012, the compute required to train a neural network to ImageNet classification performance has been decreasing by 2x every 16 months. Between 2012 and 2019, improvements in image classification algorithms led to a 97.7% reduction in the compute required to match AlexNet’s performance.

More recent analysis suggests algorithmic efficiency progress may be accelerating. Research on inference costs estimates algorithmic efficiency improvements of approximately 3x per year when isolating out competition effects, while hardware price performance has doubled approximately every two years since 2006.

TrendRateImplication for Thresholds
Frontier compute growth4-5x/yearMore models will exceed thresholds
Hardware efficiency (FLOP/W)1.28x/yearSame compute costs less
Training cost growth2.4x/yearFrontier models now cost hundreds of millions USD
Capability improvement≈15 points/year (2024)Nearly doubled from ≈8 points/year

This creates a dual challenge: on one hand, if capability growth continues to accelerate, today’s thresholds may capture far fewer models than intended. On the other hand, if algorithmic efficiency improves faster than expected, equivalent capabilities could be achieved with 10-100x less compute, allowing dangerous models to evade oversight. The GovAI research on training compute thresholds explicitly notes that “training compute is an imperfect proxy for risk” and should be used to “detect potentially risky GPAI models that warrant regulatory oversight” rather than as a standalone regulatory mechanism.

The problem is compounded by the uneven nature of efficiency improvements, which vary significantly across model architectures and training paradigms. Language models, multimodal systems, and specialized scientific models each follow different efficiency trajectories, making it difficult to set universal thresholds that remain relevant across domains. The EU AI Act acknowledges this by including Article 51(3) provisions for the Commission to “amend the thresholds… in light of evolving technological developments, such as algorithmic improvements or increased hardware efficiency.”

Sophisticated actors have multiple strategies for evading compute thresholds while achieving equivalent model performance. The following table summarizes key evasion vectors identified in governance research:

Evasion StrategyMechanismDifficultyPotential Countermeasure
Training run splittingMultiple sub-threshold runs combined via fine-tuning or mergingMediumCumulative compute tracking across related runs
Model distillationTrain large teacher model privately, distill to smaller studentHighTeacher model reporting requirements
Jurisdictional arbitrageTrain in unregulated jurisdiction, deploy globallyLowDeployment-based jurisdiction rules
Creative accountingExclude fine-tuning, inference, or multi-stage computeMediumStandardized compute definitions
Distributed trainingSplit training across jurisdictions/entitiesMediumConsolidated reporting requirements
Inference-time scalingUse test-time compute instead of training computeLow (emerging)Include inference thresholds

The distillation loophole is particularly concerning: as noted by governance researchers, “a company might use greater than 10^25 FLOPs to train a teacher model that is never marketed or used in the EU, then use that teacher model to train a smaller student model that is nearly as capable but trained using less than 10^25 FLOPs.” This allows regulatory evasion while achieving equivalent model performance.

International arbitrage allows organizations to conduct high-compute training in jurisdictions without established thresholds, then deploy globally. This creates competitive pressure for regulatory harmonization while potentially undermining the effectiveness of unilateral threshold implementations. The GovAI Know-Your-Customer proposal suggests that compute providers could help close these loopholes by identifying and reporting potentially problematic training runs.

Loading diagram...

Current threshold regimes rely primarily on self-reporting by AI developers, creating significant verification challenges. While major companies have generally complied in good faith with existing requirements, the absence of technical verification mechanisms creates enforcement vulnerabilities. Hardware-level monitoring could provide more reliable compute measurement, but raises significant privacy and trade secret concerns for AI developers.

Definitional ambiguities compound measurement challenges, particularly around edge cases like multi-stage training, transfer learning, and inference-time computation. The emergence of techniques like chain-of-thought reasoning and test-time training blur traditional boundaries between training and inference, potentially creating new categories of compute that existing thresholds don’t address.

Cloud computing platforms could provide third-party verification of compute usage, but this would require standardized reporting mechanisms and potentially compromise competitive sensitive information about training methodologies and resource allocation strategies.

A particularly significant emerging challenge is the shift from training-time to inference-time compute scaling. Toby Ord’s GovAI research on inference scaling warns that “the shift from scaling up pre-training compute to inference compute may have profound effects on AI governance. Rapid scaling of inference-at-deployment could potentially undermine AI governance measures that rely on training-compute thresholds.”

OpenAI’s o1 and o3 models demonstrate that substantial capability improvements can come from inference-time computation rather than training compute. OpenAI demonstrated their o3 model using 10,000x as much compute as o1-mini at inference time. According to Lennart Heim’s analysis, a model trained with 10^24 FLOP could have its inference scaled up by 4 orders of magnitude and perform at the level of a model trained with 10^27 FLOP—completely bypassing current regulatory thresholds. This creates a fundamental gap in current threshold regimes:

Compute TypeCurrent CoverageGovernance Challenge
Training computeCovered by EU/US thresholdsWell-defined, measurable
Fine-tuning computeAmbiguous coverageMay be excluded from calculations
Inference compute (deployment)Not coveredGrows with usage, hard to predict
Test-time trainingNot coveredBlurs training/inference boundary

As inference-time scaling becomes more prevalent, a model trained with below-threshold compute could achieve above-threshold capabilities through extensive inference-time computation, completely evading current regulatory frameworks.

Compute thresholds provide several valuable safety benefits despite their limitations. They create predictable regulatory entry points that allow companies to plan safety investments and compliance strategies in advance, rather than reacting to post-deployment requirements. The transparency requirements triggered by thresholds generate valuable information about frontier AI development that enables better risk assessment and policy development.

Threshold systems also establish precedents for AI-specific regulation that can evolve toward more sophisticated approaches over time. They provide regulatory agencies with initial experience governing AI development while building institutional capacity for more complex oversight mechanisms. The international coordination emerging around threshold harmonization creates foundations for broader AI governance cooperation.

From an industry perspective, thresholds provide regulatory certainty that enables long-term investment in safety infrastructure while creating level playing fields where all frontier developers face similar requirements.

However, compute thresholds exhibit significant safety limitations that could create false confidence in regulatory coverage. They may miss dangerous capabilities that emerge at lower compute levels, particularly in specialized domains like biotechnology or cybersecurity where domain-specific training data matters more than raw computational scale.

The static nature of current thresholds creates growing blind spots as algorithmic efficiency improves, potentially allowing increasingly capable systems to evade oversight. Threshold evasion strategies could enable bad actors to develop dangerous capabilities while avoiding regulatory scrutiny, particularly if enforcement mechanisms remain weak.

Perhaps most concerning, compute thresholds may distract from more direct capability-based assessments that could provide better safety coverage. The focus on computational inputs rather than capability outputs could lead to regulatory frameworks that miss the most important risk factors while imposing compliance burdens on relatively safe high-compute applications.

The immediate future will see operationalization of existing threshold regimes, with EU AI Act requirements becoming fully effective in August 2025 and US Executive Order provisions being codified into formal regulations. This period will provide crucial empirical data about threshold effectiveness, compliance costs, and gaming strategies that will inform future policy development.

According to GovAI forecasts on frontier model counts, the number of models exceeding absolute compute thresholds will increase superlinearly, while thresholds defined relative to the largest training run see a more stable trend of 14-16 models captured annually from 2025-2028. This suggests static absolute thresholds like the current EU and US implementations will capture an increasing number of models over time, potentially requiring significant regulatory scaling.

YearModels Exceeding 10^25 FLOP (Estimate)Models Exceeding Relative ThresholdRegulatory Implication
20245-1014-16Current capacity adequate
202515-2514-16EU compliance begins
202630-5014-16May need threshold adjustment
202760-10014-16Scaling challenges
2028100-20014-16Potential capacity crisis

International harmonization discussions are intensifying as the compliance burden of divergent threshold regimes becomes apparent to global AI developers. At the February 2025 AI Action Summit in Paris, the OECD and UK AI Safety Institute co-organized a session on “Thresholds for Frontier AI” featuring representatives from Google DeepMind, Meta, Anthropic, the Frontier Model Forum, and the EU AI Office. According to the OECD AI Policy Observatory, participants highlighted key challenges including that frontier AI systems are “by definition, novel and constantly evolving with limited data on past incidents” and that their general-purpose nature makes risk estimation difficult. Technical standards development will accelerate, particularly around compute measurement methodologies and verification mechanisms.

The medium-term trajectory will likely see significant evolution away from purely static thresholds toward more sophisticated triggering mechanisms. Algorithmic efficiency improvements will force either frequent threshold updates or adoption of alternative approaches that maintain regulatory relevance despite efficiency gains.

Capability-based triggers are expected to emerge as a complement to or replacement for compute thresholds, using standardized benchmark evaluations to determine regulatory requirements based on demonstrated abilities rather than resource consumption. GovAI research on risk thresholds recommends that “companies define risk thresholds to provide a principled foundation for their decision-making, use these to help set capability thresholds, and then primarily rely on capability thresholds.”

Threshold TypeAdvantagesDisadvantagesBest Use CaseCurrent Implementations
Compute-based (absolute)Simple, measurable, predictable; can be verified externallyBecomes obsolete with efficiency gains at 2x/8-17 monthsInitial screening, pre-training notificationEU AI Act (10^25), US EO 14110 (10^26)
Compute-based (relative)Adapts to frontier advances; maintains stable model countRequires ongoing calibration; definitional complexityCapturing only true frontier modelsProposed in GovAI research
Capability-basedDirectly measures risk-relevant propertiesHard to evaluate comprehensively; may miss novel capabilitiesPost-training safety assessmentUK AISI evaluations, Anthropic/OpenAI internal frameworks
Risk-basedMost principled approach; directly addresses harmsMost difficult to evaluate reliably; requires causal understandingStrategic decision frameworksGovAI risk thresholds research
Hybrid (compute + capability)Balances predictability with relevanceComplex to implement; higher compliance burdenLong-term regulatory evolutionEU AI Act Article 51(3) provisions for threshold updates

International regime development will likely produce multilateral frameworks for threshold coordination, potentially through new international organizations or expanded mandates for existing bodies like the OECD or UN. These frameworks will need to address both threshold harmonization and enforcement cooperation to be effective.

The long-term future of compute thresholds depends critically on the pace of algorithmic efficiency improvements and the development of alternative governance mechanisms. If efficiency gains continue at current rates, compute-based triggers may become obsolete entirely, requiring wholesale transition to capability-based or other approaches.

Alternatively, threshold evolution could incorporate dynamic adjustment mechanisms that automatically update based on efficiency benchmarks or capability correlations, maintaining relevance despite technological change. This would require sophisticated measurement systems and potentially automated regulatory frameworks.

The emergence of novel AI architectures like neuromorphic computing or quantum-classical hybrid systems could fundamentally alter the compute-capability relationship, potentially making current FLOP-based measurements irrelevant and requiring entirely new regulatory metrics.

Several critical uncertainties will determine the future effectiveness of compute threshold approaches. The pace and trajectory of algorithmic efficiency improvements remains unpredictable, with potential for breakthrough innovations that dramatically decouple compute from capabilities. Current trend extrapolation suggests 2x annual improvements, but this could accelerate or plateau depending on fundamental algorithmic advances.

The correlation between compute and dangerous capabilities is empirically understudied, particularly for specialized risks like bioweapons development or deceptive alignment. Better understanding these relationships is crucial for calibrating threshold levels and determining when capability-based triggers might be more appropriate.

Enforcement mechanisms remain largely theoretical, with limited real-world testing of verification systems or consequences for non-compliance. The willingness and ability of regulatory agencies to detect and respond to threshold evasion will ultimately determine system effectiveness.

International coordination dynamics are highly uncertain, particularly regarding participation by major AI powers like China and cooperation between democratic and authoritarian governance systems. The success of threshold regimes may depend critically on achieving sufficient global coverage to prevent regulatory arbitrage.

The development of standardized capability evaluation systems presents both technical and political challenges that could determine whether hybrid threshold-capability approaches become feasible. Progress on evaluation methodology, benchmark development, and international standards will shape the evolution of regulatory frameworks beyond pure compute triggers.


The following research organizations have produced foundational work on compute threshold governance:

OrganizationKey ContributionFocus Area
GovAITraining Compute Thresholds, Inference Scaling Governance, Risk ThresholdsThreshold design, alternative approaches
CSET GeorgetownAI Governance at the Frontier, preparedness frameworksPolicy implementation, US context
Epoch AICompute trends, training cost analysisEmpirical compute data, forecasting
UK AI Security InstituteFrontier AI Trends Report, capability evaluationsEmpirical capability assessment
OECDThresholds for Frontier AI sessionsInternational coordination, standards

  • Export Controls — Restricting access rather than triggering requirements
  • Compute Monitoring — Ongoing visibility into training
  • International Regimes — Multilateral threshold coordination


Compute thresholds improve the Ai Transition Model through Civilizational Competence:

FactorParameterImpact
Civilizational CompetenceRegulatory CapacityObjective triggers enable automated enforcement of safety requirements
Civilizational CompetenceInstitutional QualityClear thresholds reduce regulatory discretion and political capture

Threshold effectiveness depends on keeping pace with algorithmic efficiency improvements; static thresholds become obsolete within 3-5 years.