Capability-Alignment Race Model

Analysis

Capability-Alignment Race Model

Quantifies the capability-alignment race showing capabilities currently ~3 years ahead of alignment readiness, with gap widening at 0.5 years/year driven by 10²⁶ FLOP scaling vs. 15% interpretability coverage and 30% scalable oversight maturity. Projects gap reaching 5-7 years by 2030 unless alignment research funding increases from $200M to $800M annually, with 60% chance of warning shot before TAI potentially triggering governance response.

Research Areas

Organizations

People

Risks

1.8k words · 8 backlinks

Overview

The Capability-Alignment Race Model quantifies the fundamental dynamic determining AI safety: the gap between advancing capabilities and our readiness to safely deploy them. Current analysis shows capabilities ~3 years ahead of alignment readiness, with this gap widening at 0.5 years annually.

The model tracks how frontier compute (currently 10²⁶ FLOP for largest training runs) and algorithmic improvements drive capability progress at ~10-15 percentage points per year, while alignment research (Interpretability at ~15% behavior coverage—though less than 5% of frontier model computations are mechanistically understood—and scalable oversight at ~30% maturity) advances more slowly. This creates deployment pressure worth $100B annually, racing against governance systems operating at ~25% effectiveness.

List View

Computing layout...

React Flow

Node Types

Leaf Nodes

Causes

Intermediate

Effects

Arrow Strength

Strong

Medium

Weak

Risk Assessment

The following table synthesizes key risk factors shaping the capability-alignment race. Estimates reflect the interaction of several dynamics: the pace at which training compute is scaling (roughly 4–5× per year)¹, the brittleness of current safety mitigations², and strategic competitive pressures that incentivize deployment before risks are minimized.³ These factors compound: faster scaling shortens the window for alignment work, while competitive dynamics reduce willingness to pause.⁴ Probability estimates are illustrative rather than actuarial, intended to convey relative plausibility.

Factor	Severity	Likelihood	Timeline	Trend
Gap widens to 5+ years	Catastrophic	50%	2027-2030	Accelerating
Alignment breakthroughs	Critical (positive)	20%	2025-2027	Uncertain
Governance catches up	High (positive)	25%	2026-2028	Slow
Warning shots trigger response	Medium (positive)	60%	2025-2027	Increasing

Key Dynamics & Evidence

Capability Acceleration

Component	Current State	Growth Rate	2027 Projection	Source
Training compute	10²⁶ FLOP	4x/year	10²⁸ FLOP	Epoch AI↗
Algorithmic efficiency	2x 2024 baseline	1.5x/year	3.4x baseline	Erdil & Besiroglu (2023)↗
Performance (MMLU)	89%	+8pp/year	>95%	Anthropic↗
Frontier lab lead	6 months	Stable	3-6 months	RAND↗

Alignment Lag

Component	Current Coverage	Improvement Rate	2027 Projection	Critical Gap
Interpretability (behavior coverage)	15%	+5pp/year	30%	Need 80% for safety
Scalable oversight	30%	+8pp/year	54%	Need 90% for superhuman
Deceptive Alignment	20%	+3pp/year	29%	Need 95% for AGI
Alignment tax	15% loss	-2pp/year	9% loss	Target <5% for adoption

Deployment Pressure

Economic value drives rapid deployment, creating misalignment between safety needs and market incentives.

Pressure Source	Current Impact	Annual Growth	2027 Impact	Mitigation
Economic value	$500B/year	40%	$1.5T/year	Regulation, liability
Military competition	0.6/1.0 intensity	Increasing	0.8/1.0	Arms control treaties
Lab competition	6 month lead	Shortening	3 month lead	Industry coordination

Quote from Paul Christiano↗: "The core challenge is that capabilities are advancing faster than our ability to align them. If this gap continues to widen, we'll be in serious trouble."

Current State & Trajectory

2025 Snapshot

The race is in a critical phase with capabilities accelerating faster than alignment solutions. Training compute for frontier models has grown at approximately 4–5x per year since 2010, a pace that outstrips nearly every comparable technology adoption curve in history.¹ AI performance on demanding benchmarks such as GPQA and SWE-bench improved by roughly 49 and 67 percentage points respectively between 2023 and 2024 alone.⁵ Meanwhile, METR has noted that the gap between AI capabilities and safety mitigations is growing fast across multiple risk categories, and current control methods do not reliably scale past certain capability levels without further scientific progress.²

Key dimensions of the current situation:

Frontier models approaching human-level performance on many expert benchmarks
Alignment research still in early stages with limited coverage of capability space
Governance systems lagging significantly behind technical progress
Economic incentives strongly favor rapid deployment over safety

Self-replication evaluation success rates among frontier systems increased from 5% in 2023 to 60% in 2025, illustrating how rapidly dangerous capability thresholds are being crossed.⁶

5-Year Projections

Metric	Current	2027	2030	Risk Level
Capability-alignment gap	3 years	4-5 years	5-7 years	Critical
Deployment pressure	0.7/1.0	0.85/1.0	0.9/1.0	High
Governance strength	0.25/1.0	0.4/1.0	0.6/1.0	Improving
Warning shot probability	15%/year	20%/year	25%/year	Increasing

Based on Metaculus forecasts↗ and expert surveys from AI Impacts↗.

These projections rest on several key assumptions. First, they assume compute scaling continues at roughly its current rate; Epoch AI projects that training runs will hit a practical 9-month economic ceiling around 2027, after which compute growth must come from hardware scaling or distributed clusters.⁷ Second, they assume algorithmic efficiency gains continue—compute required to reach a given capability level is declining roughly 3x annually—which means effective capability could grow far faster than raw compute figures suggest.⁸ Third, they assume no sudden international coordination. By late 2020s or early 2030s, effective compute accessible to leading labs could reach roughly one million times GPT-4's training compute when algorithmic progress is factored in.⁹ Any of these assumptions shifting substantially would alter the trajectory.

Open-weight models complicate the picture further: they lag state-of-the-art by only approximately three months on average, meaning safeguards on hosted systems cannot be assumed to contain capability diffusion.¹⁰

Potential Turning Points

Critical junctures that could alter trajectories:

Major alignment breakthrough (20% chance by 2027): Interpretability or oversight advance that halves the gap. The Alignment Project's £27 million international fund, supported by governments, industry, and philanthropy, is specifically targeting interpretability and oversight mechanisms.¹¹ Progress here could compress the gap meaningfully, but Anthropic notes that no one currently knows how to train very powerful AI systems to be robustly helpful and harmless.⁴
Capability plateau (15% chance): Scaling laws break down, slowing capability progress. Epoch AI identifies four plausible constraints—power availability, chip manufacturing capacity, data scarcity, and latency walls—any of which could arrest growth before 2030.¹²
Coordinated pause (10% chance): International agreement to pause frontier development. GovAI research on strategic dynamics finds that technology laggards willing to cut corners gamble for advantage when they are close in capability to leaders, making coordination fragile.³
Warning shot incident (60% chance by 2027): Serious but recoverable AI accident that triggers policy response. Anthropic has noted that rapid AI progress may trigger competitive races leading corporations or nations to deploy untrustworthy AI systems with catastrophic results.⁴

Key Uncertainties & Research Cruxes

Technical Uncertainties

Question	Current Evidence	Expert Consensus	Implications
Can interpretability scale to frontier models?	Limited success on smaller models	45% optimistic	Determines alignment feasibility
Will scaling laws continue?	Training compute grows ≈4–5× per year¹	70% continue to 2027	Core driver of capability timeline
How much alignment tax is acceptable?	Safety-focused firms spend 30–40% of dev cycles on alignment¹³	Target <5% overhead	Adoption vs. safety tradeoff

Current AI control methods do not reliably scale past certain capability levels without further scientific progress.² Empirical benchmarks reveal trade-offs between safety and utility, and no monotonic "bigger is safer" trend exists—high-parameter models remain vulnerable under certain attacks regardless of scale.⁹

Governance Questions

Regulatory capture: Will AI labs co-opt government oversight? CNAS analysis↗ suggests 40% risk. Technology laggards willing to cut corners can gamble for advantage when highly adversarial and capability-close to leaders.¹²
International coordination: Can major powers cooperate on AI safety? The gap between AI capabilities and mitigations is growing fast across multiple risk categories.¹⁰
Democratic response: Will public concern drive effective policy? Polling shows growing awareness↗ but uncertain translation to action.

Strategic Cruxes

Core disagreements among experts on alignment difficulty reflect genuinely different empirical predictions. If interpretability does scale, the technical optimist position strengthens considerably; if not, coordination or pause strategies become relatively more attractive. Resolution of the alignment-tax question directly affects competitive dynamics: alignment techniques that also improve capabilities—as RLHF has done—reduce the race pressure the model predicts.¹⁴ Meanwhile, rapid benchmark improvements (GPQA up 48.9 percentage points between 2023 and 2024⁵) suggest capability timelines may compress faster than any of the positions below assume.

Technical optimism: 35% believe alignment will prove tractable
Governance solution: 25% think coordination/pause is the path forward
Warning shots help: 60% expect helpful wake-up calls before catastrophe
Timeline matters: 80% agree slower development improves outcomes

Timeline of Critical Events

The table below synthesizes projected milestones across capability development, alignment research, and governance. These projections carry substantial uncertainty: training compute for frontier models has grown approximately 4–5× per year¹, and algorithmic efficiency improvements mean effective compute could reach roughly one million times GPT-4's training compute by the early 2030s.⁹ Benchmark performance in some domains has been doubling every eight months.⁶ Dates should be treated as rough central estimates rather than firm predictions. Metaculus community forecasts and expert surveys inform the structure of this timeline, though specific outcomes remain deeply contested.

Period	Capability Milestones	Alignment Progress	Governance Developments
2025	GPT-5 level, 80% human tasks	Basic interpretability tools	EU AI Act implementation
2026	Multimodal AGI claims	Scalable oversight demos	US federal AI legislation
2027	Superhuman in most domains	Alignment tax <10%	International AI treaty
2028	Recursive self-improvement	Deception detection tools	Compute governance regime
2030	Transformative AI deployment	Mature alignment stack	Global coordination framework

Based on Metaculus community predictions↗ and Future of Humanity Institute surveys↗.

Resource Requirements & Strategic Investments

Understanding the scale of alignment investment requires context against total AI industry spending. U.S. private AI investment reached $109.1 billion in 2024, nearly 12 times China's $9.3 billion.⁵ Against this backdrop, dedicated alignment funding remains a small fraction of total spending. The UK AI Security Institute's Alignment Project had assembled over £27 million in total alignment research funding as of February 2026, with OpenAI contributing £5.6 million of that total.⁹ Safety-focused companies like Anthropic and OpenAI report spending 30–40% of development cycles on alignment and safety features.¹

Priority Funding Areas

Analysis suggests optimal resource allocation to narrow the gap:

Investment Area	Current Funding	Recommended	Gap Reduction	ROI
Alignment research	$200M/year	$800M/year	0.8 years	High
Interpretability	$50M/year	$300M/year	0.3 years	Very high
Governance capacity	$100M/year	$400M/year	Indirect (time)	Medium
Coordination/pause	$30M/year	$200M/year	Variable	High if successful

Key Organizations & Initiatives

Leading efforts to address the capability-alignment gap:

Organization	Focus	Annual Budget	Approach
Anthropic	Constitutional AI	$500M	Constitutional training
DeepMind	Alignment team	$100M	Scalable oversight
MIRI	Agent foundations	$15M	Theoretical foundations
ARC	Alignment research	$20M	Empirical alignment

For historical context, total AI safety spending was estimated at roughly $9 million in 2017 and approximately $40 million globally in 2019, with average annual funding increases of about $13 million per year between 2014 and 2024.¹⁵ The Alignment Project's first funding round received over 800 applications from 466 institutions across 42 countries, signaling rapidly growing researcher interest relative to available funding.¹¹

This model connects to several other risk analyses:

Racing Dynamics: How competition accelerates capability development
Multipolar Trap: Coordination failures in competitive environments
Warning Signs: Indicators of dangerous capability-alignment gaps
Takeoff dynamics: Speed of AI development and adaptation time

The model also informs key debates:

Pause vs. Proceed: Whether to slow capability development
Open vs. Closed: Model release policies and proliferation speed
Regulation Approaches: Government responses to the race dynamic

Sources & Resources

Academic Papers & Research

Study	Key Finding	Citation
Scaling Laws	Compute-capability relationship	Kaplan et al. (2020)↗
Alignment Tax Analysis	Safety overhead quantification	Kenton et al. (2021)↗
Governance Lag Study	Policy adaptation timelines	[D

Epoch AI, Training compute of frontier AI models grows by 4-5x per year (https://epoch.ai/blog/training-compute-of-... — Epoch AI, Training compute of frontier AI models grows by 4-5x per year (https://epoch.ai/blog/training-compute-of-frontier-ai-models-grows-by-4-5x-per-year) ↩ ↩² ↩³ ↩⁴ ↩⁵
AISI, How we're addressing the gap between AI capabilities and mitigations (https://aisi.gov.uk/blog/aisis-research... — AISI, How we're addressing the gap between AI capabilities and mitigations (https://aisi.gov.uk/blog/aisis-research-direction-for-technical-solutions) ↩ ↩² ↩³
GovAI, Safety Not Guaranteed: International Strategic Dynamics of Risky Technology Races (https://governance.ai/res... — GovAI, Safety Not Guaranteed: International Strategic Dynamics of Risky Technology Races (https://governance.ai/research-paper/safety-not-guaranteed-international-strategic-dynamics-of-risky-technology-races) ↩ ↩²
Anthropic, Core Views on AI Safety (https://anthropic.com/news/core-views-on-ai-safety) ↩ ↩² ↩³
Stanford HAI, "Artificial Intelligence Index Report 2025" (https://hai-production.s3.amazonaws.com/files/hai_ai_index... — Stanford HAI, "Artificial Intelligence Index Report 2025" (https://hai-production.s3.amazonaws.com/files/hai_ai_index_report_2025.pdf) ↩ ↩² ↩³
AISI, "Frontier AI Trends Report" (https://aisi.gov.uk/frontier-ai-trends-report) ↩ ↩²
Epoch AI, "Frontier LLM training runs can't get much longer" (https://epoch.ai/data-insights/longest-training-run) ↩
arXiv, "Compute Requirements for Algorithmic Innovation in Frontier AI Models" (https://arxiv.org/pdf/2507.10618) ↩
CNAS, "Future-Proofing Frontier AI Regulation" (https://cnas.org/publications/reports/future-proofing-frontier-ai-reg... — CNAS, "Future-Proofing Frontier AI Regulation" (https://cnas.org/publications/reports/future-proofing-frontier-ai-regulation) ↩ ↩² ↩³ ↩⁴
Epoch AI Brief, October 2025 (https://epochai.substack.com/p/the-epoch-ai-brief-october-2025) ↩ ↩²
AISI, "Funding 60 projects to advance AI alignment research" (https://aisi.gov.uk/blog/funding-60-projects-to-advance... — AISI, "Funding 60 projects to advance AI alignment research" (https://aisi.gov.uk/blog/funding-60-projects-to-advance-ai-alignment-research) ↩ ↩²
Citation rc-112f ↩ ↩²
The AI Alignment Tax (https://getmonetizely.com/articles/the-ai-alignment-tax-understanding-the-cost-of-safety-in-a... — The AI Alignment Tax (https://getmonetizely.com/articles/the-ai-alignment-tax-understanding-the-cost-of-safety-in-ai-capability-development) ↩
LessWrong, Alignment can be the 'clean energy' of AI (https://lesswrong.com/posts/irxuoCTKdufEdskSk/alignment-can-b... — LessWrong, Alignment can be the 'clean energy' of AI (https://lesswrong.com/posts/irxuoCTKdufEdskSk/alignment-can-be-the-clean-energy-of-ai) ↩
LessWrong, An Overview of the AI Safety Funding Situation (https://lesswrong.com/posts/WGpFFJo2uFe5ssgEb/an-overvie... — LessWrong, An Overview of the AI Safety Funding Situation (https://lesswrong.com/posts/WGpFFJo2uFe5ssgEb/an-overview-of-the-ai-safety-funding-situation) ↩

References

1Why AI Projects Fail and How They Can SucceedRAND Corporation·2024▸

This RAND Corporation research report analyzes the common reasons AI projects fail in practice, examining organizational, technical, and governance challenges. It provides evidence-based recommendations for improving AI project outcomes across government and industry contexts. The report is particularly relevant for understanding the gap between AI capabilities and successful real-world deployment.

★★★★☆

rand.org

2AISI: Funding 60 Projects to Advance AI Alignment ResearchUK AI Safety Institute·Government▸

The UK AI Safety Institute announced the first 60 grantees of the Alignment Project, a global funding initiative now totaling £27 million with partners including OpenAI, Microsoft, and Anthropic. The program received 800+ applications from 466 institutions across 42 countries, selecting projects spanning mathematics, learning theory, economics, and cognitive science. This represents a significant institutional commitment to scaling alignment research infrastructure globally.

★★★★☆

aisi.gov.uk

3Future-Proofing Frontier AI RegulationCNAS▸

This CNAS report projects that frontier AI compute could increase 1,000-fold by the early 2030s through scaling alone, without fundamental breakthroughs, driven entirely by private investment. It analyzes how rising training costs may concentrate frontier capabilities in an oligopoly while algorithmic improvements enable rapid proliferation, and assesses that export controls on advanced chips delay capability proliferation by only 2-3 years.

★★★★☆

cnas.org

4Trends in Machine Learning HardwareEpoch AI▸

Epoch AI analyzes performance trends across 47 ML accelerators (GPUs and AI chips) from 2010-2023, finding that computational performance doubles every 2.3 years, price-performance every 2.1 years, and energy efficiency every 3 years, while memory capacity lags behind (doubling every 4 years). The study also highlights how lower-precision formats (FP16, INT8) and tensor cores provide order-of-magnitude speedups over traditional FP32, and examines memory bandwidth and interconnect constraints.

★★★★☆

epochai.org

5AI experts show significant disagreementAI Impacts▸

The 2022 ESPAI surveyed 738 machine learning researchers (NeurIPS/ICML authors) about AI progress timelines and risks, serving as a replication and update of the 2016 survey. Key findings include an aggregate forecast of 50% chance of HLMI by 2059 (37 years from 2022), with significant disagreement among experts about timelines and risks.

★★★☆☆

aiimpacts.org

6Regulating Artificial Intelligence (CNAS Report)CNAS▸

This resource is a 404 page — the original CNAS (Center for a New American Security) report on regulating artificial intelligence is no longer available at this URL. The content cannot be assessed.

★★★★☆

cnas.org

7How AISI Is Addressing the Gap Between AI Capabilities and Safety MitigationsUK AI Safety Institute·Government▸

The UK AI Security Institute (AISI) outlines its technical research strategy to close the growing gap between AI capabilities and available safety mitigations. Research is organized into two pillars: Safeguards Analysis (countering misuse by adversarial actors) and Control & Alignment (preventing loss of control of highly capable systems). AISI is actively hiring researchers and offering up to £200,000 in Challenge Fund grants to academics and non-profits to accelerate progress on critical safety challenges.

★★★★☆

aisi.gov.uk

8Americans Largely Positive About Increased Use Of Artificial IntelligencePew Research Center▸

This Pew Research Center survey from February 2023 examined American public attitudes toward the growing use of artificial intelligence across various domains. The page returns a 404 error, so the full content is unavailable, but the study likely covered public perceptions of AI benefits, risks, and concerns across demographics. Such surveys inform policymakers and researchers about societal readiness and concerns surrounding AI deployment.

★★★★☆

pewresearch.org

9Erdil & Besiroglu (2023)arXiv·Sarah Gao & Andrew Kean Gao·2023·Paper▸

Erdil & Besiroglu (2023) address the challenge of understanding the landscape of Large Language Models by creating Constellation, a comprehensive atlas and web application for exploring nearly 16,000 LLMs uploaded to Hugging Face. Using hierarchical clustering based on model nomenclature, n-grams, and TF-IDF analysis, the authors successfully identify LLM families and organize them into meaningful subgroups. The resulting interactive tool provides multiple visualization options including dendrograms, graphs, word clouds, and scatter plots, enabling researchers and practitioners to navigate and understand trends in the rapidly expanding LLM ecosystem.

★★★☆☆

arxiv.org

10The AI Alignment Tax: Understanding the Cost of Safety in AI Capability Developmentgetmonetizely.com▸

This article introduces the concept of the 'alignment tax' — the performance, speed, or capability costs incurred when implementing safety measures in AI systems. It examines the tradeoffs between building capable AI and ensuring it remains aligned with human values and intentions.

getmonetizely.com

11Kaplan et al. (2020)arXiv·Jared Kaplan et al.·2020·Paper▸

Kaplan et al. (2020) empirically characterize scaling laws for language model performance, demonstrating that cross-entropy loss follows power-law relationships with model size, dataset size, and compute budget across seven orders of magnitude. The study reveals that architectural details like width and depth have minimal impact, while overfitting and training speed follow predictable patterns. Crucially, the findings show that larger models are significantly more sample-efficient, implying that optimal compute-efficient training involves training very large models on modest datasets and stopping before convergence.

★★★☆☆

arxiv.org

12Metaculus prediction marketsMetaculus▸

A Metaculus forecasting question tracking community predictions on when artificial general intelligence will be achieved. Aggregates probabilistic estimates from forecasters worldwide, providing a crowd-sourced timeline estimate for AGI development.

★★★☆☆

metaculus.com

13Epoch AI, "Frontier LLM training runs can't get much longer" (https://epoch.ai/data-insights/longest-training-run)Epoch AI▸

Epoch AI analyzes the physical and practical limits on how long frontier AI training runs can be extended, finding that training duration is approaching natural ceilings due to hardware reliability, data constraints, and optimization dynamics. The analysis suggests that simply scaling training time is not a viable path for continued capability gains at the frontier.

★★★★☆

epoch.ai

14Future of Humanity Institute surveysFuture of Humanity Institute▸

A policy-oriented brief from the Future of Humanity Institute summarizing key risks from advanced AI systems and providing guidance for policymakers. It distills technical AI safety concerns into actionable policy recommendations, drawing on FHI's research expertise to inform governance decisions.

★★★★☆

fhi.ox.ac.uk

15Artificial Intelligence Index Report 2025 (Stanford HAI)hai-production.s3.amazonaws.com▸

Stanford HAI's annual AI Index Report 2025 provides a comprehensive empirical overview of AI development trends, including advances in capabilities, investment, governance, and societal impact. It aggregates data from academia, industry, and government to track the state of AI globally. The report serves as a key reference for understanding where AI stands and how it is evolving across technical, economic, and policy dimensions.

hai-production.s3.amazonaws.com

16Safety Not Guaranteed: International Strategic Dynamics of Risky Technology RacesCentre for the Governance of AI·Government▸

This GovAI paper uses game-theoretic modeling to analyze when states in great-power competitions deploy risky technologies like AI before adequate safety measures exist. It finds that adversarial conditions and small capability gaps incentivize unsafe deployment, and identifies a 'risk compensation' problem where safety improvements may paradoxically encourage riskier behavior by competitors.

★★★★☆

governance.ai

17arXiv, "Compute Requirements for Algorithmic Innovation in Frontier AI Models" (https://arxiv.org/pdf/2507.10618)arXiv·Peter Barnett·2025·Paper▸

This paper empirically catalogs 36 pre-training algorithmic innovations from Llama 3 and DeepSeek-V3, estimating the compute (FLOPs and hardware capacity) required for each. The central finding is that compute caps—even very stringent ones—would not dramatically slow algorithmic progress, as roughly half of the cataloged innovations could have been developed under severe compute restrictions. Resource-intensive innovations are doubling their compute requirements annually, suggesting a growing bifurcation in the innovation landscape.

★★★☆☆

arxiv.org

18Paul Christiano's AI Alignment ResearchAlignment Forum·Blog post▸

Paul Christiano is a leading AI alignment researcher and founder of ARC (Alignment Research Center), known for foundational contributions including iterated amplification, debate as an alignment technique, and eliciting latent knowledge (ELK). His work addresses existential risks from advanced AI, responsible scaling policies, and core technical challenges in ensuring AI systems remain beneficial and under human oversight.

★★★☆☆

alignmentforum.org

19Epoch AI Brief, October 2025 (https://epochai.substack.com/p/the-epoch-ai-brief-october-2025)Substack·Blog post▸

The October 2025 Epoch AI Brief summarizes Epoch AI's research findings on decentralized training feasibility (10 GW runs across distributed sites), the launch of the Epoch Capabilities Index (ECI) as a unified benchmark aggregation metric, and analysis of leading LLM performance including FrontierMath evaluations and OpenAI revenue trends.

★★☆☆☆

epochai.substack.com

20Alignment can be the 'clean energy' of AILessWrong·Blog post▸

This LessWrong post argues that AI alignment should be reframed as a competitive advantage and economic opportunity rather than a regulatory burden, drawing a parallel to how renewable energy became economically competitive through sustained R&D investment. The authors contend that alignment techniques like RLHF already provide commercial value and that continued investment will reduce alignment costs while increasing capabilities, making safety and competitiveness complementary rather than opposed.

★★★☆☆

lesswrong.com

21Kenton et al. (2021)arXiv·Stephanie Lin, Jacob Hilton & Owain Evans·2021·Paper▸

Kenton et al. (2021) introduce TruthfulQA, a benchmark of 817 questions across 38 categories designed to measure whether language models generate truthful answers. The benchmark specifically includes questions where humans commonly hold false beliefs, requiring models to avoid reproducing misconceptions from training data. Testing GPT-3, GPT-Neo/J, GPT-2, and T5-based models revealed that the best model achieved only 58% truthfulness compared to 94% human performance. Notably, larger models performed worse on truthfulness despite excelling at other NLP tasks, suggesting that scaling alone is insufficient and that alternative training objectives beyond text imitation are needed to improve model truthfulness.

★★★☆☆

arxiv.org

Capability-Alignment Race Model