Alignment Research Center (ARC)

Safety Org

Alignment Research Center (ARC)

Part of AI Safety Organizations (Overview)

Comprehensive reference page on ARC (Alignment Research Center), covering its evolution from a dual theory/evals organization to ARC Theory (3 permanent researchers) plus the METR spin-out (December 2023), with specific funding figures ($265K Coefficient Giving (formerly Open Philanthropy) grant, $1.25M returned FTX grant), ELK prize details ($274K total), and Christiano's 20%/46% doom estimates. Content is well-sourced compilation of publicly available information with no original analysis.

EA Forum

TypeSafety Org

Founded2021

LocationBerkeley, CA

Employees~20

Funding~$10M/year

Websitealignment.org

People

Risks

Organizations

3.7k words · 44 backlinks

Overview

The Alignment Research Center (ARC)↗ is a nonprofit AI safety research organization founded in 2021 by Paul Christiano after his departure from OpenAI. ARC's current focus is theoretical research on alignment, specifically work on heuristic arguments for understanding neural network behavior — an approach ARC describes as occupying a middle ground between interpretability and formal verification.¹

ARC originally operated two divisions: a theory research team and an evaluations team (ARC Evaluations). In September 2023, ARC announced that ARC Evals would spin out as an independent organization, and in December 2023, ARC Evals formally became METR (Model Evaluation & Threat Research), an independent 501(c)(3) nonprofit.² As of early 2024, ARC (now sometimes called "ARC Theory" to distinguish it from METR) is a small team of three permanent researchers — Christiano, Mark Xu, and Jacob Hilton — plus a varying number of temporary members.³

ARC's primary funders have included Coefficient Giving, which made at least one documented grant of $265,000 in 2022.⁴ ARC also received and subsequently returned a $1.25 million grant from Sam Bankman-Fried's FTX Foundation following FTX's bankruptcy.⁴ Christiano was appointed Head of AI Safety at the US AI Safety Institute (housed at NIST) in April 2024, though this is a personal appointment rather than an institutional contract with ARC.⁵

Organizational Structure and History

ARC Theory (Current)

Following the METR spin-out, ARC focuses exclusively on theoretical alignment research. As of March 2024, the permanent Theory team consists of Paul Christiano, Mark Xu, and Jacob Hilton, with a varying number of temporary researchers (recently 0–3).³ ARC shares office space with other AI alignment groups including Redwood Research.³

The Theory team describes its work as "often somewhat similar to academic research in pure math or theoretical computer science," and hiring as of early 2024 sought researchers with strong backgrounds in mathematics, physics, or theoretical computer science.³

METR (Formerly ARC Evals)

ARC incubated ARC Evals beginning in 2022, hiring Beth Barnes (a former OpenAI researcher) to lead exploratory work on independent evaluations of frontier AI models. ARC Evals completed evaluations of GPT-4 (in partnership with OpenAI) and Claude (in partnership with Anthropic), formally partnered with the UK's Foundation Model Taskforce, and grew to become a majority of ARC's headcount.⁶

The growth in ARC Evals' size prompted formalization of the separation. The spin-out was announced on September 19, 2023,⁶ and the formal name change to METR was completed on December 4, 2023.² METR is now an independent 501(c)(3) nonprofit led by Beth Barnes as CEO, and continues to conduct pre-deployment evaluations of frontier AI models for autonomous capabilities. It is treated separately in this knowledge base: see METR.

Risk Assessment

Risk Category	Assessment	Evidence	Timeline
Deceptive Alignment	High severity, moderate likelihood	ELK research identifies difficulty of ensuring truthfulness	2025–2030
Capability Evaluations	Moderate severity, high likelihood	Models may not reveal full capabilities during testing	Ongoing
Governance capture by labs	Moderate severity, contested likelihood	Debate over whether self-regulation by labs is sufficient; ARC and others argue for independent evaluation	2024–2027
Alignment research stagnation	High severity, low likelihood	Theoretical problems may be intractable	2025–2035

Note: These assessments reflect perspectives prevalent in the ARC-adjacent research community and are not independently verified consensus estimates. The "Governance capture" row reflects a live debate — some researchers view lab-evaluator cooperation as beneficial rather than a risk.

Key Research Contributions

Research Areas

No data available.

ARC Theory: Eliciting Latent Knowledge

Contribution	Description	Impact	Status
ELK Problem Formulation	How to train AI to report its internal knowledge rather than what it predicts humans want to hear	Influenced field framing of truthfulness and Scalable Oversight	Ongoing research
Heuristic Arguments (FPI)	Mathematical framework for reasoning about neural network behavior under uncertainty; machine-checkable but not requiring perfect certainty	Published as "Formalizing the Presumption of Independence" (2022); follow-up paper October 2024	Active development
Worst-Case Alignment	Framework assuming AI might be adversarially deceptive, requiring robust safety measures	Adopted by some researchers; disputed by others who prioritize more probable failure modes	Ongoing debate

The ELK Challenge: The ELK (Eliciting Latent Knowledge) problem concerns how to train an AI system to report its actual internal beliefs rather than what it predicts an observer wants to hear. ARC's research has identified numerous proposed solutions and their failure modes.⁷ The ELK problem remains unsolved; ARC characterizes it as "a problem we don't know how to solve, where we think rapid progress is being made."⁸

The ELK Prize: From January to February 2022, ARC ran a prize competition for proposed ELK solutions. ARC received 197 proposals and awarded 32 prizes ranging from $5,000 to $20,000, plus 24 honorable mentions of $1,000 each, for a total of $274,000 in prizes.⁹ The first round alone distributed $70,000 among 8 people based on 30 distinct proposals from 25 submitters.¹⁰ No single submission fully resolved the ELK problem; prizes were awarded for partial progress and novel perspectives.

Heuristic Arguments Research Program: ARC's second major report, "Formalizing the Presumption of Independence" (FPI), introduces a framework for "heuristic arguments" — reasoning structures similar to proofs, except their conclusions are not guaranteed to be correct and can be overturned by counterarguments.¹¹ A follow-up paper, "Towards a Law of Iterated Expectations for Heuristic Estimators," was released in October 2024, introducing a coherence property for heuristic estimators called the "principle of unpredictable errors."¹² Applications under investigation include mechanistic anomaly detection, safe distillation, and low-probability estimation.¹²

ARC describes its current research focus as attempting to combine mechanistic interpretability and formal verification: developing formal mechanistic explanations of neural network behavior that are machine-checkable without requiring perfect certainty.¹

ARC Evals / METR: Systematic Capability Assessment

Note: ARC Evals became METR in December 2023. The evaluation work described below was conducted under the ARC Evals name; ongoing evaluation work is conducted by METR as an independent organization. See METR for current activities.

Evaluation Type	Purpose	Key Models Tested	Policy Impact
Autonomous Replication	Can model copy itself to new servers and acquire resources?	GPT-4, Claude	Informed deployment decisions; cited in GPT-4 system card
Strategic Deception	Can model mislead evaluators?	Multiple frontier models	Contributed to RSP threshold-setting discussions
Resource Acquisition	Can model obtain money or compute autonomously?	Various models	Referenced in policy discussions around White House AI Executive Order
Situational Awareness	Does model understand its context and deployment situation?	Latest frontier models	Lab safety protocol development

GPT-4 and Claude Evaluation Results: ARC Evals' 2023 evaluations of GPT-4 (in partnership with OpenAI) and Claude (in partnership with Anthropic) found that neither model was capable of autonomously carrying out dangerous activities at the time of testing.¹³ However, models succeeded at several component tasks: browsing the internet, persuading humans to perform actions, and making long-term plans.¹³ One publicized example: GPT-4 successfully presented itself as a vision-impaired human to convince a TaskRabbit worker to solve a CAPTCHA.¹³ The GPT-4 system card notes: "the current model is probably not yet capable of autonomously [replicating and acquiring resources]."¹⁴

ARC Evals stated at the time: "We think that, for systems more capable than Claude and GPT-4, we are now at the point where we need to check carefully that new models do not have sufficient capabilities to replicate autonomously or cause catastrophic harm — it's no longer obvious that they won't be able to."¹⁵

Evaluation Methodology:

Red-team approach: Adversarial testing designed to elicit worst-case capabilities
Capability elicitation: Ensuring tests reveal true abilities, not merely default behaviors
Pre-deployment assessment: Testing before public release, with researcher oversight during testing
Threshold-based recommendations: Criteria for deployment decisions tied to observed capability levels

Current State and Trajectory

Research Progress (2024–2025)

Research Area	Current Status	2025–2027 Outlook
ELK Solutions	Multiple approaches proposed; all have known counterexamples identified in ARC research	Incremental progress expected; complete solution not anticipated near-term
Heuristic Arguments	FPI paper published (2022); follow-up paper released October 2024	Further mathematical development; applications to specific alignment subproblems
Theoretical Alignment	Active research on formal mechanistic explanations	May develop connections to empirical interpretability work
Policy Influence (via METR)	METR (ARC spin-out) engaged with UK AISI and international bodies	Independent of ARC as of December 2023; trajectories now separate

ARC's homepage notes that as of 2025, the organization has been "making conceptual and theoretical progress at the fastest pace since 2022."¹

Organizational Evolution

2021–2022: Founded as a theoretical alignment research organization; primary output is the ELK report and associated prize competition.

2022–2023: ARC incubates ARC Evals, hiring Beth Barnes to lead independent evaluations of frontier AI models; ARC Evals conducts GPT-4 and Claude evaluations.

September 2023: ARC announces ARC Evals will spin out as an independent organization.⁶

December 2023: ARC Evals formally becomes METR, an independent 501(c)(3) nonprofit. ARC returns to being a small theory-focused organization.²

April 2024: Paul Christiano appointed Head of AI Safety at the U.S. AI Safety Institute (NIST); this is a personal government appointment, not an institutional ARC contract.⁵

2024–present: ARC Theory team (Christiano, Xu, Hilton) continues heuristic arguments research; hiring paused as of January 2024, with plans to reopen in the second half of 2024.³

Policy Impact

ARC's policy influence operates primarily through two channels: the theoretical research program (which has shaped how some researchers and policymakers conceptualize alignment problems) and the evaluation work that was conducted under the ARC Evals name before the METR spin-out. Since December 2023, ongoing evaluation-related policy influence flows through METR rather than ARC.

Policy Area	Channel of Influence	Evidence	Current Status
Lab Evaluation Practices	ARC Evals methodology (now METR)	GPT-4, Claude evaluations cited in system cards	Ongoing through METR
US Government AI Policy	Christiano's personal NIST AISI appointment	April 2024 NIST announcement	Active (Christiano role)
UK AISI Collaboration	ARC Evals / METR partnership	UK AISI evaluation methodology draws on METR dataset¹⁶	Ongoing through METR
Responsible Scaling Policies	Consultation on evaluation thresholds	Anthropic RSP framework development	Referenced in RSP documentation
Academic Research	ELK problem formulation, FPI paper	Cited in alignment literature	Ongoing

Key Organizational Leaders

Key People

No data available.

Core Team

Paul Christiano

Founder, Theory Team Lead (also Head of AI Safety, US AISI as of April 2024)

Former OpenAI language model alignment team lead; developed foundational RLHF work; PhD from UC Berkeley; BSc mathematics from MIT

Mark Xu

Permanent Research Scientist, Theory Team

One of two permanent researchers alongside Christiano as of early 2024

Jacob Hilton

Permanent Research Scientist, Theory Team

Co-author on FPI and follow-up heuristic arguments papers

Note on Ajeya Cotra: Earlier versions of this page listed Ajeya Cotra as a Senior Researcher at ARC. Cotra is associated with Coefficient Giving, where she conducted AI timelines research, rather than being a member of ARC's research staff. This has been corrected.

Note on Beth Barnes: Barnes founded and led ARC Evals while it was incubated at ARC. She is now CEO of METR, the independent organization that resulted from the ARC Evals spin-out in December 2023. She is no longer part of ARC.

Paul Christiano's Background and Views

Christiano founded ARC in 2021 after running the language model alignment team at OpenAI, where he conducted foundational work on RLHF. He holds a PhD in computer science from UC Berkeley and a BS in mathematics from MIT.⁵

In a 2023 post on LessWrong ("My views on doom"), Christiano estimated a 20% probability that most humans die within 10 years of building powerful AI, and a 46% probability that humanity has "irreversibly messed up its future" within that timeframe.¹⁷ These estimates were noted in coverage of his April 2024 NIST appointment, with some reports indicating concerns among NIST staff about his EA and longtermism associations.¹⁷

In March 2024, Christiano gave a talk at Princeton on "Catastrophic Misalignment of Large Language Models," reviewing evidence on two misalignment pathways and discussing how to assess whether such risks are adequately managed.¹⁸

Research Philosophy: ARC's methodology involves attempting to rule out alignment approaches by identifying plausible failure scenarios on paper, without necessarily implementing them. ARC notes this approach may "completely miss strategies that exploit important structure in realistic ML models," but the benefit is the ability to evaluate many ideas quickly.¹ This is described as a "builder-breaker" methodology.

Key Uncertainties and Research Cruxes

Fundamental Research Questions

Key Questions

?Is the ELK problem solvable, or does it represent a fundamental limitation of scalable oversight approaches?
?How much should researchers update on ARC's heuristic arguments against prosaic alignment approaches?
?Can evaluations detect sophisticated deception, or can advanced models successfully sandbag against current evaluation methodologies?
?Is worst-case alignment the appropriate level of caution, or should the field focus on more probable failure modes?
?Will ARC's theoretical heuristic arguments work lead to actionable safety solutions, or primarily to negative results about what will not work?
?How can evaluation organizations maintain independence while working closely with the AI labs whose models they evaluate?

Cruxes in the Field

Disagreement	ARC Position	Alternative View	Evidence Status
Adversarial AI likelihood	AI systems may engage in strategic deception; safety measures should be robust against this	Most misalignment will result from honest mistakes or distribution shift rather than strategic behavior	Insufficient empirical data
Evaluation sufficiency	Evaluations are necessary but not sufficient as a governance tool	Pre-deployment evaluations may provide false confidence without addressing underlying alignment	Mixed; METR itself notes "pre-deployment capability testing is not a sufficient risk management strategy by itself"¹⁹
Theoretical tractability	Hard theoretical problems are worth pursuing; negative results have value	Field should prioritize near-term practical solutions over potentially intractable theoretical work	Ongoing debate
Timeline assumptions	Solutions needed for potentially short timelines to powerful AI	More time available for iterative empirical approaches	Highly uncertain

Organizational Relationships and Influence

Collaboration Network

Organization	Relationship Type	Collaboration Areas	Notes
OpenAI	Former evaluator (via ARC Evals)	GPT-4 pre-deployment evaluation (2022–2023)	Now conducted through METR
Anthropic	Former evaluator, research adjacency	Claude evaluations, RSP development consultation	Now conducted through METR
UK AISI	Indirect (via METR)	Evaluation methodology; UK AISI dataset draws on METR work¹⁶	Partnership with METR, not directly with ARC Theory
MIRI	Intellectual adjacency	Shared concern about alignment difficulty; different research strategies	No formal collaboration
Redwood Research	Office-sharing	Shared Berkeley office space³	No formal research collaboration described
METR	Spin-out / former division	ARC incubated METR (as ARC Evals) 2022–2023	Independent since December 2023

Policy Impact Timeline

Date	Development	ARC/METR Role
2022	OpenAI GPT-4 pre-deployment evaluation	ARC Evals conducted evaluation; cited in GPT-4 system card¹⁴
2023	Anthropic RSP framework	Consultation on evaluation thresholds
2023	White House AI Executive Order	Evaluation requirements informed by evaluation community including ARC Evals
2023	UK Foundation Model Taskforce partnership	ARC Evals formally partnered before METR spin-out⁶
Dec 2023	METR spin-out completed	ARC Evals becomes independent METR nonprofit²
Apr 2024	Paul Christiano joins US AISI	Personal appointment as Head of AI Safety at NIST⁵
2024	EU AI Act implementation	Evaluation standards discussions; METR tracks frontier AI safety policies²⁰

Critical Analysis and Limitations

Methodological Concerns

Limitation	Description	Mitigation Efforts	Remaining Uncertainty
Sandbagging Detection	Models may conceal capabilities during evaluation	Capability elicitation techniques developed by METR	Acknowledged as high uncertainty by METR¹⁹
Evaluation Gaming	Models might perform well on specific evaluation tasks while retaining other concerning capabilities	Red-team methodology; diverse task suites	Moderate; ongoing area of research
Coverage Gaps	Evaluations can only test capabilities evaluators anticipate	Broad capability assessment across multiple domains	Significant; acknowledged in METR communications
Independence	Financial and operational relationships with labs being evaluated	Organizational separation (METR now independent); diverse funding	Ongoing structural debate

Criticism from the Research Community

"Excessive Pessimism" (Yann LeCun↗ and some researchers):

Heuristic arguments identify possible failure modes, not inevitable ones
Current AI systems display cooperative behavior
Worst-case framing may direct resources away from more probable problems

"Insufficient Positive Agendas" (some academic AI safety researchers):

ELK and heuristic arguments work has so far produced primarily negative results (counterexamples to proposed solutions)
The field may need more constructive research programs alongside problem identification
Risk that sophisticated problem-articulation proceeds faster than solution development

ARC's Response (as reflected in published methodology):

Negative results prevent false confidence in approaches that will eventually fail
Worst-case preparation is appropriate given stakes and uncertainty about timelines
The "builder-breaker" methodology is explicitly designed to iterate quickly on theoretical ideas

Note: Whether ARC's responses adequately address these critiques is itself contested; critics may find the responses insufficient and the debate is ongoing.

Future Research Directions

Theoretical Research Evolution

Current Focus (as of 2024–2025):

Heuristic arguments framework development, building on FPI and follow-up work
Formal mechanistic explanations of neural network behavior
Applications to mechanistic anomaly detection, safe distillation, and low-probability estimation¹²

ARC describes the approach as combining mechanistic interpretability and formal verification: "If we had a deep understanding of what was going on inside a neural network... [we could] produce formal mechanistic explanations."¹

Potential Directions (Speculative):

More tractable subproblems of alignment as heuristic arguments framework matures
Empirical testing of theoretical constructs
Closer integration with empirical interpretability research at other organizations

Evaluation Methodology Advancement (METR)

These directions apply to METR, ARC's independent spin-out, rather than ARC Theory itself:

Development Area	Current State	Stated Goals
General Autonomous Capabilities	≈50 automatically scored tasks across cybersecurity, software engineering, ML²¹	Broader task suite; better capability bounds
AI R&D Evaluations	RE-Bench and related benchmarks; human expert baselines collected²²	Identifying "red lines" for AI R&D acceleration
Post-deployment Monitoring	Limited	Continuous assessment research
International Standards	Advisory role with UK AISI and others	Coordinated evaluation protocols

Policy Integration

Near-term:

METR continues advising AI developers and governments on evaluation methodology
Christiano's NIST role connects ARC's intellectual tradition to US government AI safety efforts
Twelve major AI companies have published frontier AI safety policies as of late 2025, with METR tracking commonalities²⁰

Medium-term (Speculative):

Potential expansion of mandatory independent evaluation requirements
International coordination on evaluation standards (Seoul AI Safety Summit commitments: sixteen companies agreed to Frontier AI Safety Commitments in May 2024²⁰)

Sources and Resources

Primary Sources

Source Type	Key Documents
Foundational Papers	ELK Report; Formalizing the Presumption of Independence
Evaluation Reports	GPT-4 System Card (ARC evaluation cited); ARC Evals 2023 Update
Organizational Announcements	ARC Evals Spin-out Announcement; METR Name Announcement

ARC Official Homepage, Alignment Research Center, accessed 2025. Describes current research focus on combining mechanistic interpretability and formal verification. ↩ ↩² ↩³ ↩⁴ ↩⁵
ARC Evals is now METR, METR team, December 4, 2023. Formal announcement of name change and independent 501(c)(3) status. ↩ ↩² ↩³ ↩⁴
ARC is Hiring Theoretical Researchers, ARC team, March 20, 2024. Describes Theory team composition (Christiano, Xu, Hilton), temporary members, office-sharing with Redwood Research, and hiring pause as of January 2024. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Alignment Research Center — General Support, Coefficient Giving, August 15, 2022. Documents $265,000 grant and notes ARC returned $1.25M FTX grant. ↩ ↩²
U.S. Commerce Secretary Announces Expansion of US AI Safety Institute Leadership Team, NIST / U.S. Department of Commerce, April 2024. Christiano named Head of AI Safety at US AISI; role involves designing and conducting tests of frontier AI models. ↩ ↩² ↩³ ↩⁴
ARC Evals is spinning out from ARC, METR / ARC Evals team, September 19, 2023. Describes ARC Evals' growth, evaluations of GPT-4 and Claude, UK Taskforce partnership, and rationale for spin-out. ↩ ↩² ↩³ ↩⁴
A Bird's Eye View of ARC's Research, ARC team, 2023. Describes the two central subproblems: alignment robustness and ELK. ↩
ELK prize results post, ARC team, 2022, as cited in Edmund Mills prize writeup. ARC's assessment: "a problem we don't know how to solve, where we think rapid progress is being made." ↩
ELK Prize Results, ARC team, 2022. Documents 197 proposals received, 32 prizes of $5k–$20k, 24 honorable mentions of $1k, total $274,000. ↩
ELK First Round Contest Winners, ARC team, 2022. Documents first round: 30 proposals from 25 people, $70,000 awarded to 8 people. ↩
ARC paper: Formalizing the Presumption of Independence, AI Alignment Forum, 2023. Describes heuristic arguments as machine-checkable reasoning that doesn't require perfect certainty; applications include mechanistic anomaly detection. ↩
Research update: Towards a Law of Iterated Expectations for Heuristic Estimators, ARC team, October 2024. Introduces "principle of unpredictable errors"; three applications: mechanistic anomaly detection, safe distillation, low probability estimation. ↩ ↩² ↩³
Update on ARC's Recent Eval Efforts, ARC Evals team, March 18, 2023. GPT-4 and Claude not capable of autonomous dangerous activities; succeeded at component tasks including CAPTCHA social engineering. ↩ ↩² ↩³
GPT-4 System Card, OpenAI, March 2023. Describes ARC evaluation of power-seeking behavior; conclusion: "probably not yet capable of autonomously" replicating and acquiring resources. Notes ARC did not have access to the final deployed model version. ↩ ↩²
More Information About the Dangerous Capability Evaluations We Did With GPT-4 and Claude, ARC Evals team, LessWrong, 2023. Describes evaluation methodology and states future more capable models require careful capability checking. ↩
Advanced AI Evaluations at AISI: May Update, UK AI Safety Institute, May 2024. States that long-horizon tasks in UK AISI evaluations were drawn from METR's dataset; confirms ongoing collaboration on methodology and task design. ↩ ↩²
My views on "doom", Paul Christiano, LessWrong, 2023. Christiano estimates 20% probability most humans die within 10 years of building powerful AI; 46% probability of irreversibly compromised future. ↩ ↩²
Catastrophic Misalignment of Large Language Models, Paul Christiano talk at Princeton Language and Intelligence, March 2024. ↩
An Update on METR's Preliminary Evaluations of Claude 3.5 Sonnet and o1, METR team, January 31, 2025. Notes limitations preventing robust capability bounds and states "pre-deployment capability testing is not a sufficient risk management strategy by itself." ↩ ↩²
Common Elements of Frontier AI Safety Policies, METR team, December 2025. Documents sixteen companies' Seoul AI Safety Commitments (May 2024) and twelve companies with published frontier AI safety policies. ↩ ↩² ↩³
An Update on Our General Capability Evaluations, METR team, August 6, 2024. Describes ~50 automatically scored tasks across cybersecurity, software engineering, and machine learning domains. ↩
METR's Preliminary Evaluation of Claude 3.5 Sonnet, METR team, 2024. Describes evaluation methodology including 38 day-long task attempts by human ML experts as baseline. ↩

References

1Alignment Research Centeralignment.org▸

The Alignment Research Center (ARC) is a non-profit research organization focused on technical AI alignment and safety research. ARC works on understanding and addressing risks from advanced AI systems, including interpretability, evaluations, and identifying dangerous AI capabilities before deployment.

alignment.org

2RAND Provides Objective Research Services and Public Policy AnalysisRAND Corporation▸

RAND Corporation is a nonprofit research organization providing objective analysis and policy recommendations across a wide range of topics including national security, technology, governance, and emerging risks. It produces influential studies on AI policy, cybersecurity, and global governance challenges. RAND's work is frequently cited by governments and policymakers worldwide.

★★★★☆

rand.org

3AI Alignment ForumAlignment Forum·Blog post▸

The AI Alignment Forum is a central community platform for technical AI safety and alignment research discussion. The featured post argues against 'reductive utility' (utility functions over possible worlds) and proposes the Jeffrey-Bolker framework as an alternative that avoids ontological crises and computability constraints by grounding preferences in agent-relative events rather than universal physics.

★★★☆☆

alignmentforum.org

4ARC's first technical report: Eliciting Latent KnowledgeLessWrong·paulfchristiano, Mark Xu & Ajeya Cotra·2021▸

ARC's foundational technical report introduces Eliciting Latent Knowledge (ELK) as a central open problem in AI alignment: how to extract what an AI system actually 'knows' about the world rather than what it reports. The report surveys multiple proposed approaches to mapping between an AI's internal world-model and human concepts, and explains why this problem is both hard and critical to solving alignment.

★★★☆☆

lesswrong.com

5Responsible Scaling PolicyAnthropic▸

Anthropic introduces its Responsible Scaling Policy (RSP), a framework of technical and organizational protocols for managing catastrophic risks as AI systems become more capable. The policy defines AI Safety Levels (ASL-1 through ASL-5+), modeled after biosafety level standards, requiring increasingly strict safety, security, and operational measures tied to a model's potential for catastrophic risk. Current Claude models are classified ASL-2, with ASL-3 and beyond triggering stricter deployment and security requirements.

★★★★☆

anthropic.com

6Yann LeCun's Twitter/X Profiletwitter.com▸

Twitter/X profile of Yann LeCun, Chief AI Scientist at Meta and Turing Award winner, known for his vocal skepticism of AGI risk narratives and advocacy for open-source AI development. His posts frequently challenge mainstream AI safety concerns and offer a prominent counterpoint to existential risk framings.

twitter.com

7ARENA Benchmark (GitHub - 404)GitHub▸

This GitHub repository link appears to be a 404 page, suggesting the ARENA-Benchmark repository has been moved, deleted, or made private. The original resource was associated with ARC Evals and likely contained benchmark tools for evaluating AI model capabilities and safety-relevant behaviors.

★★★☆☆

github.com

8Center for AI Safety (CAIS) – HomepageCenter for AI Safety▸

The Center for AI Safety (CAIS) is a research organization focused on mitigating catastrophic and existential risks from advanced AI systems. It conducts technical research, publishes surveys and statements, and supports field-building efforts across academia and industry. CAIS is notable for its broad coalition-building, including its widely-cited statement on AI extinction risk signed by leading researchers.

★★★★☆

safe.ai

9GPT-4 System CardOpenAI▸

OpenAI's system card for GPT-4 documents safety evaluations, risk assessments, and mitigations conducted prior to deployment. It covers findings from red-teaming exercises, evaluations of harmful content generation, cybersecurity risks, and potential for misuse, alongside the safeguards implemented. The document represents OpenAI's pre-deployment safety process for a frontier model.

★★★★☆

openai.com

10eliciting latent knowledgedocs.google.com▸

This document outlines the Eliciting Latent Knowledge (ELK) problem, a core AI alignment challenge focused on getting AI systems to report what they actually 'know' internally rather than what appears correct to human evaluators. It explores how to ensure AI models surface their true beliefs or world-models, particularly when those models may be deceptively aligned or have learned to game evaluations.

docs.google.com

11GovAI helps decision-makers navigate the transition to a world with advanced AI, by producing rigorous research and fostering talent." name="description"/><meta content="GovAI | HomeCentre for the Governance of AI·Government▸

The Centre for the Governance of AI (GovAI) is a leading research organization dedicated to helping decision-makers navigate the transition to a world with advanced AI. It produces rigorous research on AI governance, policy, and societal impacts, while fostering a global talent pipeline for responsible AI oversight. GovAI bridges technical AI safety concerns with practical policy recommendations.

★★★★☆

governance.ai

12GPT-4 System CardOpenAI▸

OpenAI's system card for GPT-4 documents safety evaluations, risk assessments, and mitigation measures conducted prior to deployment. It covers dangerous capability evaluations, red-teaming findings, and the RLHF-based safety interventions applied to reduce harmful outputs. The document represents OpenAI's public accountability framework for responsible deployment of a frontier AI model.

★★★★☆

cdn.openai.com

13alignment.org▸

alignment.org

14UK AI Safety Institute renamed to AI Security InstituteUK AI Safety Institute·Government▸

The UK AI Safety Institute evaluated five anonymized large language models across cyber, chemical/biological, agent, and jailbreak dimensions. Key findings show models exhibit PhD-level CBRN knowledge, limited but real cybersecurity capabilities, nascent agentic behavior, and widespread vulnerability to jailbreaks—providing an early empirical baseline for frontier model risk assessment.

★★★★☆

aisi.gov.uk

15METR Capability Evaluations Update: Claude Sonnet and OpenAI o1METR▸

METR presents updated capability evaluations for Claude Sonnet and OpenAI o1 models, assessing whether these frontier AI systems approach autonomy thresholds relevant to safety and deployment decisions. The evaluations focus on task autonomy and the potential for models to pose novel risks as their capabilities scale.

★★★★☆

metr.org

16METR's analysis of 12 companiesMETR▸

METR analyzes the safety policies of 12 frontier AI companies to identify common elements, commitments, and gaps in how organizations approach responsible deployment of advanced AI systems. The analysis synthesizes patterns across responsible scaling policies, model cards, and safety frameworks to provide a comparative overview of industry norms. It serves as a reference for understanding where consensus exists and where significant variation or absence of commitments remains.

★★★★☆

metr.org

Property	Value	As Of	Source
Legal Structure	501(c)(3) nonprofit
Headquarters	Berkeley, CA
Founded Date	Oct 2021

Property	Value	As Of
Net Assets	$7.1M	2023
2 earlier values 2022$3.4M 2021$322K
Revenue	$10M	2023
2 earlier values 2022$4.5M 2021$476K
Annual Expenses	$6.3M	2023
2 earlier values 2022$1.4M 2021$153K

Property	Value	As Of	Source
Founder (text)	Paul Christiano
Founded By	Paul Christiano

Title	Date	EventType	Description	Significance
Founded by Paul Christiano	2021	founding	Founded by Paul Christiano after his departure from OpenAI; nonprofit AI safety research organization focused on theoretical alignment.	major
ARC Evals incubated; Beth Barnes hired to lead it	2022	launch	Began incubating ARC Evals — exploratory work on independent evaluations of frontier AI models. Completed evaluations of GPT-4 (with OpenAI) and Claude (with Anthropic).	major
Open Philanthropy grant ($265K)	2022	funding	Documented grant from Coefficient Giving (then Open Philanthropy).	moderate
Returned $1.25M FTX Foundation grant after FTX bankruptcy	2022	funding	—	major
ARC Evals spin-out announced	2023-09-19	pivot	Growth in ARC Evals' size prompted formalization of separation.	major
ARC Evals formally renamed METR	2023-12-04	pivot	Model Evaluation & Threat Research; independent 501(c)(3) nonprofit led by Beth Barnes.	major
Christiano appointed Head of AI Safety at US AISI	2024-04	leadership-change	Personal appointment at NIST rather than an institutional contract with ARC.	major

Alignment Research Center (ARC)