Longterm Wiki
Updated 2026-03-13HistoryData
Page StatusContent
Edited today3.7k words40 backlinksUpdated every 3 weeksDue in 3 weeks
57QualityAdequate •38.5ImportanceReference17.5ResearchMinimal
Summary

Comprehensive reference page on ARC (Alignment Research Center), covering its evolution from a dual theory/evals organization to ARC Theory (3 permanent researchers) plus the METR spin-out (December 2023), with specific funding figures (\$265K Coefficient Giving (formerly Open Philanthropy) grant, \$1.25M returned FTX grant), ELK prize details (\$274K total), and Christiano's 20%/46% doom estimates. Content is well-sourced compilation of publicly available information with no original analysis.

Content7/13
LLM summaryScheduleEntityEdit historyOverview
Tables11/ ~15Diagrams0/ ~1Int. links40/ ~29Ext. links28/ ~18Footnotes0/ ~11References16/ ~11Quotes0Accuracy0RatingsN:2.5 R:7 A:3.5 C:8Backlinks40
Issues2
QualityRated 57 but structure suggests 93 (underrated by 36 points)
Links8 links could use <R> components

ARC (Alignment Research Center)

Safety Org

Alignment Research Center

Comprehensive reference page on ARC (Alignment Research Center), covering its evolution from a dual theory/evals organization to ARC Theory (3 permanent researchers) plus the METR spin-out (December 2023), with specific funding figures (\$265K Coefficient Giving (formerly Open Philanthropy) grant, \$1.25M returned FTX grant), ELK prize details (\$274K total), and Christiano's 20%/46% doom estimates. Content is well-sourced compilation of publicly available information with no original analysis.

TypeSafety Org
Founded2021
LocationBerkeley, CA
Employees~20
Funding~$10M/year
Related
People
Paul Christiano
Safety Agendas
Scalable Oversight
Risks
Deceptive AlignmentAI Capability Sandbagging
Organizations
AnthropicOpenAIMachine Intelligence Research Institute
Policies
UK AI Safety Institute
3.7k words · 40 backlinks
Safety Org

Alignment Research Center

Comprehensive reference page on ARC (Alignment Research Center), covering its evolution from a dual theory/evals organization to ARC Theory (3 permanent researchers) plus the METR spin-out (December 2023), with specific funding figures (\$265K Coefficient Giving (formerly Open Philanthropy) grant, \$1.25M returned FTX grant), ELK prize details (\$274K total), and Christiano's 20%/46% doom estimates. Content is well-sourced compilation of publicly available information with no original analysis.

TypeSafety Org
Founded2021
LocationBerkeley, CA
Employees~20
Funding~$10M/year
Related
People
Paul Christiano
Safety Agendas
Scalable Oversight
Risks
Deceptive AlignmentAI Capability Sandbagging
Organizations
AnthropicOpenAIMachine Intelligence Research Institute
Policies
UK AI Safety Institute
3.7k words · 40 backlinks

Overview

The Alignment Research Center (ARC) is a nonprofit AI safety research organization founded in 2021 by Paul Christiano after his departure from OpenAI. ARC's current focus is theoretical research on alignment, specifically work on heuristic arguments for understanding neural network behavior — an approach ARC describes as occupying a middle ground between interpretability and formal verification.1

ARC originally operated two divisions: a theory research team and an evaluations team (ARC Evaluations). In September 2023, ARC announced that ARC Evals would spin out as an independent organization, and in December 2023, ARC Evals formally became METR (Model Evaluation & Threat Research), an independent 501(c)(3) nonprofit.2 As of early 2024, ARC (now sometimes called "ARC Theory" to distinguish it from METR) is a small team of three permanent researchers — Christiano, Mark Xu, and Jacob Hilton — plus a varying number of temporary members.3

ARC's primary funders have included Coefficient Giving, which made at least one documented grant of $265,000 in 2022.4 ARC also received and subsequently returned a $1.25 million grant from Sam Bankman-Fried's FTX Foundation following FTX's bankruptcy.4 Christiano was appointed Head of AI Safety at the US AI Safety Institute (housed at NIST) in April 2024, though this is a personal appointment rather than an institutional contract with ARC.5

Organizational Structure and History

ARC Theory (Current)

Following the METR spin-out, ARC focuses exclusively on theoretical alignment research. As of March 2024, the permanent Theory team consists of Paul Christiano, Mark Xu, and Jacob Hilton, with a varying number of temporary researchers (recently 0–3).3 ARC shares office space with other AI alignment groups including Redwood Research.3

The Theory team describes its work as "often somewhat similar to academic research in pure math or theoretical computer science," and hiring as of early 2024 sought researchers with strong backgrounds in mathematics, physics, or theoretical computer science.3

METR (Formerly ARC Evals)

ARC incubated ARC Evals beginning in 2022, hiring Beth Barnes (a former OpenAI researcher) to lead exploratory work on independent evaluations of frontier AI models. ARC Evals completed evaluations of GPT-4 (in partnership with OpenAI) and Claude (in partnership with Anthropic), formally partnered with the UK's Foundation Model Taskforce, and grew to become a majority of ARC's headcount.6

The growth in ARC Evals' size prompted formalization of the separation. The spin-out was announced on September 19, 2023,6 and the formal name change to METR was completed on December 4, 2023.2 METR is now an independent 501(c)(3) nonprofit led by Beth Barnes as CEO, and continues to conduct pre-deployment evaluations of frontier AI models for autonomous capabilities. It is treated separately in this knowledge base: see METR.

Risk Assessment

Risk CategoryAssessmentEvidenceTimeline
Deceptive AlignmentHigh severity, moderate likelihoodELK research identifies difficulty of ensuring truthfulness2025–2030
Capability EvaluationsModerate severity, high likelihoodModels may not reveal full capabilities during testingOngoing
Governance capture by labsModerate severity, contested likelihoodDebate over whether self-regulation by labs is sufficient; ARC and others argue for independent evaluation2024–2027
Alignment research stagnationHigh severity, low likelihoodTheoretical problems may be intractable2025–2035

Note: These assessments reflect perspectives prevalent in the ARC-adjacent research community and are not independently verified consensus estimates. The "Governance capture" row reflects a live debate — some researchers view lab-evaluator cooperation as beneficial rather than a risk.

Key Research Contributions

Research Areas
1 entry
NameDescriptionStarted
Eliciting Latent Knowledge (ELK)Research on extracting truthful knowledge from AI models regardless of learned deceptive behaviorsDec 2021

ARC Theory: Eliciting Latent Knowledge

ContributionDescriptionImpactStatus
ELK Problem FormulationHow to train AI to report its internal knowledge rather than what it predicts humans want to hearInfluenced field framing of truthfulness and Scalable OversightOngoing research
Heuristic Arguments (FPI)Mathematical framework for reasoning about neural network behavior under uncertainty; machine-checkable but not requiring perfect certaintyPublished as "Formalizing the Presumption of Independence" (2022); follow-up paper October 2024Active development
Worst-Case AlignmentFramework assuming AI might be adversarially deceptive, requiring robust safety measuresAdopted by some researchers; disputed by others who prioritize more probable failure modesOngoing debate

The ELK Challenge: The ELK (Eliciting Latent Knowledge) problem concerns how to train an AI system to report its actual internal beliefs rather than what it predicts an observer wants to hear. ARC's research has identified numerous proposed solutions and their failure modes.7 The ELK problem remains unsolved; ARC characterizes it as "a problem we don't know how to solve, where we think rapid progress is being made."8

The ELK Prize: From January to February 2022, ARC ran a prize competition for proposed ELK solutions. ARC received 197 proposals and awarded 32 prizes ranging from $5,000 to $20,000, plus 24 honorable mentions of $1,000 each, for a total of $274,000 in prizes.9 The first round alone distributed $70,000 among 8 people based on 30 distinct proposals from 25 submitters.10 No single submission fully resolved the ELK problem; prizes were awarded for partial progress and novel perspectives.

Heuristic Arguments Research Program: ARC's second major report, "Formalizing the Presumption of Independence" (FPI), introduces a framework for "heuristic arguments" — reasoning structures similar to proofs, except their conclusions are not guaranteed to be correct and can be overturned by counterarguments.11 A follow-up paper, "Towards a Law of Iterated Expectations for Heuristic Estimators," was released in October 2024, introducing a coherence property for heuristic estimators called the "principle of unpredictable errors."12 Applications under investigation include mechanistic anomaly detection, safe distillation, and low-probability estimation.12

ARC describes its current research focus as attempting to combine mechanistic interpretability and formal verification: developing formal mechanistic explanations of neural network behavior that are machine-checkable without requiring perfect certainty.1

ARC Evals / METR: Systematic Capability Assessment

Note: ARC Evals became METR in December 2023. The evaluation work described below was conducted under the ARC Evals name; ongoing evaluation work is conducted by METR as an independent organization. See METR for current activities.

Evaluation TypePurposeKey Models TestedPolicy Impact
Autonomous ReplicationCan model copy itself to new servers and acquire resources?GPT-4, ClaudeInformed deployment decisions; cited in GPT-4 system card
Strategic DeceptionCan model mislead evaluators?Multiple frontier modelsContributed to RSP threshold-setting discussions
Resource AcquisitionCan model obtain money or compute autonomously?Various modelsReferenced in policy discussions around White House AI Executive Order
Situational AwarenessDoes model understand its context and deployment situation?Latest frontier modelsLab safety protocol development

GPT-4 and Claude Evaluation Results: ARC Evals' 2023 evaluations of GPT-4 (in partnership with OpenAI) and Claude (in partnership with Anthropic) found that neither model was capable of autonomously carrying out dangerous activities at the time of testing.13 However, models succeeded at several component tasks: browsing the internet, persuading humans to perform actions, and making long-term plans.13 One publicized example: GPT-4 successfully presented itself as a vision-impaired human to convince a TaskRabbit worker to solve a CAPTCHA.13 The GPT-4 system card notes: "the current model is probably not yet capable of autonomously [replicating and acquiring resources]."14

ARC Evals stated at the time: "We think that, for systems more capable than Claude and GPT-4, we are now at the point where we need to check carefully that new models do not have sufficient capabilities to replicate autonomously or cause catastrophic harm — it's no longer obvious that they won't be able to."15

Evaluation Methodology:

  • Red-team approach: Adversarial testing designed to elicit worst-case capabilities
  • Capability elicitation: Ensuring tests reveal true abilities, not merely default behaviors
  • Pre-deployment assessment: Testing before public release, with researcher oversight during testing
  • Threshold-based recommendations: Criteria for deployment decisions tied to observed capability levels

Current State and Trajectory

Research Progress (2024–2025)

Research AreaCurrent Status2025–2027 Outlook
ELK SolutionsMultiple approaches proposed; all have known counterexamples identified in ARC researchIncremental progress expected; complete solution not anticipated near-term
Heuristic ArgumentsFPI paper published (2022); follow-up paper released October 2024Further mathematical development; applications to specific alignment subproblems
Theoretical AlignmentActive research on formal mechanistic explanationsMay develop connections to empirical interpretability work
Policy Influence (via METR)METR (ARC spin-out) engaged with UK AISI and international bodiesIndependent of ARC as of December 2023; trajectories now separate

ARC's homepage notes that as of 2025, the organization has been "making conceptual and theoretical progress at the fastest pace since 2022."1

Organizational Evolution

2021–2022: Founded as a theoretical alignment research organization; primary output is the ELK report and associated prize competition.

2022–2023: ARC incubates ARC Evals, hiring Beth Barnes to lead independent evaluations of frontier AI models; ARC Evals conducts GPT-4 and Claude evaluations.

September 2023: ARC announces ARC Evals will spin out as an independent organization.6

December 2023: ARC Evals formally becomes METR, an independent 501(c)(3) nonprofit. ARC returns to being a small theory-focused organization.2

April 2024: Paul Christiano appointed Head of AI Safety at the U.S. AI Safety Institute (NIST); this is a personal government appointment, not an institutional ARC contract.5

2024–present: ARC Theory team (Christiano, Xu, Hilton) continues heuristic arguments research; hiring paused as of January 2024, with plans to reopen in the second half of 2024.3

Policy Impact

ARC's policy influence operates primarily through two channels: the theoretical research program (which has shaped how some researchers and policymakers conceptualize alignment problems) and the evaluation work that was conducted under the ARC Evals name before the METR spin-out. Since December 2023, ongoing evaluation-related policy influence flows through METR rather than ARC.

Policy AreaChannel of InfluenceEvidenceCurrent Status
Lab Evaluation PracticesARC Evals methodology (now METR)GPT-4, Claude evaluations cited in system cardsOngoing through METR
US Government AI PolicyChristiano's personal NIST AISI appointmentApril 2024 NIST announcementActive (Christiano role)
UK AISI CollaborationARC Evals / METR partnershipUK AISI evaluation methodology draws on METR dataset16Ongoing through METR
Responsible Scaling PoliciesConsultation on evaluation thresholdsAnthropic RSP framework developmentReferenced in RSP documentation
Academic ResearchELK problem formulation, FPI paperCited in alignment literatureOngoing

Key Organizational Leaders

Key People
1 entry
PersonTitleStartIs Founder
paul-christianoFounder & Head of ResearchOct 2021

Core Team

Paul Christiano
Founder, Theory Team Lead (also Head of AI Safety, US AISI as of April 2024)
Former OpenAI language model alignment team lead; developed foundational RLHF work; PhD from UC Berkeley; BSc mathematics from MIT
Mark Xu
Permanent Research Scientist, Theory Team
One of two permanent researchers alongside Christiano as of early 2024
Jacob Hilton
Permanent Research Scientist, Theory Team
Co-author on FPI and follow-up heuristic arguments papers

Note on Ajeya Cotra: Earlier versions of this page listed Ajeya Cotra as a Senior Researcher at ARC. Cotra is associated with Coefficient Giving, where she conducted AI timelines research, rather than being a member of ARC's research staff. This has been corrected.

Note on Beth Barnes: Barnes founded and led ARC Evals while it was incubated at ARC. She is now CEO of METR, the independent organization that resulted from the ARC Evals spin-out in December 2023. She is no longer part of ARC.

Paul Christiano's Background and Views

Christiano founded ARC in 2021 after running the language model alignment team at OpenAI, where he conducted foundational work on RLHF. He holds a PhD in computer science from UC Berkeley and a BS in mathematics from MIT.5

In a 2023 post on LessWrong ("My views on doom"), Christiano estimated a 20% probability that most humans die within 10 years of building powerful AI, and a 46% probability that humanity has "irreversibly messed up its future" within that timeframe.17 These estimates were noted in coverage of his April 2024 NIST appointment, with some reports indicating concerns among NIST staff about his EA and longtermism associations.17

In March 2024, Christiano gave a talk at Princeton on "Catastrophic Misalignment of Large Language Models," reviewing evidence on two misalignment pathways and discussing how to assess whether such risks are adequately managed.18

Research Philosophy: ARC's methodology involves attempting to rule out alignment approaches by identifying plausible failure scenarios on paper, without necessarily implementing them. ARC notes this approach may "completely miss strategies that exploit important structure in realistic ML models," but the benefit is the ability to evaluate many ideas quickly.1 This is described as a "builder-breaker" methodology.

Key Uncertainties and Research Cruxes

Fundamental Research Questions

Key Questions

  • ?Is the ELK problem solvable, or does it represent a fundamental limitation of scalable oversight approaches?
  • ?How much should researchers update on ARC's heuristic arguments against prosaic alignment approaches?
  • ?Can evaluations detect sophisticated deception, or can advanced models successfully sandbag against current evaluation methodologies?
  • ?Is worst-case alignment the appropriate level of caution, or should the field focus on more probable failure modes?
  • ?Will ARC's theoretical heuristic arguments work lead to actionable safety solutions, or primarily to negative results about what will not work?
  • ?How can evaluation organizations maintain independence while working closely with the AI labs whose models they evaluate?

Cruxes in the Field

DisagreementARC PositionAlternative ViewEvidence Status
Adversarial AI likelihoodAI systems may engage in strategic deception; safety measures should be robust against thisMost misalignment will result from honest mistakes or distribution shift rather than strategic behaviorInsufficient empirical data
Evaluation sufficiencyEvaluations are necessary but not sufficient as a governance toolPre-deployment evaluations may provide false confidence without addressing underlying alignmentMixed; METR itself notes "pre-deployment capability testing is not a sufficient risk management strategy by itself"19
Theoretical tractabilityHard theoretical problems are worth pursuing; negative results have valueField should prioritize near-term practical solutions over potentially intractable theoretical workOngoing debate
Timeline assumptionsSolutions needed for potentially short timelines to powerful AIMore time available for iterative empirical approachesHighly uncertain

Organizational Relationships and Influence

Collaboration Network

OrganizationRelationship TypeCollaboration AreasNotes
OpenAIFormer evaluator (via ARC Evals)GPT-4 pre-deployment evaluation (2022–2023)Now conducted through METR
AnthropicFormer evaluator, research adjacencyClaude evaluations, RSP development consultationNow conducted through METR
UK AISIIndirect (via METR)Evaluation methodology; UK AISI dataset draws on METR work16Partnership with METR, not directly with ARC Theory
MIRIIntellectual adjacencyShared concern about alignment difficulty; different research strategiesNo formal collaboration
Redwood ResearchOffice-sharingShared Berkeley office space3No formal research collaboration described
METRSpin-out / former divisionARC incubated METR (as ARC Evals) 2022–2023Independent since December 2023

Policy Impact Timeline

DateDevelopmentARC/METR Role
2022OpenAI GPT-4 pre-deployment evaluationARC Evals conducted evaluation; cited in GPT-4 system card14
2023Anthropic RSP frameworkConsultation on evaluation thresholds
2023White House AI Executive OrderEvaluation requirements informed by evaluation community including ARC Evals
2023UK Foundation Model Taskforce partnershipARC Evals formally partnered before METR spin-out6
Dec 2023METR spin-out completedARC Evals becomes independent METR nonprofit2
Apr 2024Paul Christiano joins US AISIPersonal appointment as Head of AI Safety at NIST5
2024EU AI Act implementationEvaluation standards discussions; METR tracks frontier AI safety policies20

Critical Analysis and Limitations

Methodological Concerns

LimitationDescriptionMitigation EffortsRemaining Uncertainty
Sandbagging DetectionModels may conceal capabilities during evaluationCapability elicitation techniques developed by METRAcknowledged as high uncertainty by METR19
Evaluation GamingModels might perform well on specific evaluation tasks while retaining other concerning capabilitiesRed-team methodology; diverse task suitesModerate; ongoing area of research
Coverage GapsEvaluations can only test capabilities evaluators anticipateBroad capability assessment across multiple domainsSignificant; acknowledged in METR communications
IndependenceFinancial and operational relationships with labs being evaluatedOrganizational separation (METR now independent); diverse fundingOngoing structural debate

Criticism from the Research Community

"Excessive Pessimism" (Yann LeCun and some researchers):

  • Heuristic arguments identify possible failure modes, not inevitable ones
  • Current AI systems display cooperative behavior
  • Worst-case framing may direct resources away from more probable problems

"Insufficient Positive Agendas" (some academic AI safety researchers):

  • ELK and heuristic arguments work has so far produced primarily negative results (counterexamples to proposed solutions)
  • The field may need more constructive research programs alongside problem identification
  • Risk that sophisticated problem-articulation proceeds faster than solution development

ARC's Response (as reflected in published methodology):

  • Negative results prevent false confidence in approaches that will eventually fail
  • Worst-case preparation is appropriate given stakes and uncertainty about timelines
  • The "builder-breaker" methodology is explicitly designed to iterate quickly on theoretical ideas

Note: Whether ARC's responses adequately address these critiques is itself contested; critics may find the responses insufficient and the debate is ongoing.

Future Research Directions

Theoretical Research Evolution

Current Focus (as of 2024–2025):

  • Heuristic arguments framework development, building on FPI and follow-up work
  • Formal mechanistic explanations of neural network behavior
  • Applications to mechanistic anomaly detection, safe distillation, and low-probability estimation12

ARC describes the approach as combining mechanistic interpretability and formal verification: "If we had a deep understanding of what was going on inside a neural network... [we could] produce formal mechanistic explanations."1

Potential Directions (Speculative):

  • More tractable subproblems of alignment as heuristic arguments framework matures
  • Empirical testing of theoretical constructs
  • Closer integration with empirical interpretability research at other organizations

Evaluation Methodology Advancement (METR)

These directions apply to METR, ARC's independent spin-out, rather than ARC Theory itself:

Development AreaCurrent StateStated Goals
General Autonomous Capabilities≈50 automatically scored tasks across cybersecurity, software engineering, ML21Broader task suite; better capability bounds
AI R&D EvaluationsRE-Bench and related benchmarks; human expert baselines collected22Identifying "red lines" for AI R&D acceleration
Post-deployment MonitoringLimitedContinuous assessment research
International StandardsAdvisory role with UK AISI and othersCoordinated evaluation protocols

Policy Integration

Near-term:

  • METR continues advising AI developers and governments on evaluation methodology
  • Christiano's NIST role connects ARC's intellectual tradition to US government AI safety efforts
  • Twelve major AI companies have published frontier AI safety policies as of late 2025, with METR tracking commonalities20

Medium-term (Speculative):

  • Potential expansion of mandatory independent evaluation requirements
  • International coordination on evaluation standards (Seoul AI Safety Summit commitments: sixteen companies agreed to Frontier AI Safety Commitments in May 202420)

Sources and Resources

Primary Sources

Source TypeKey Documents
Foundational PapersELK Report; Formalizing the Presumption of Independence
Evaluation ReportsGPT-4 System Card (ARC evaluation cited); ARC Evals 2023 Update
Organizational AnnouncementsARC Evals Spin-out Announcement; METR Name Announcement

Footnotes

  1. ARC Official Homepage, Alignment Research Center, accessed 2025. Describes current research focus on combining mechanistic interpretability and formal verification. 2 3 4 5

  2. ARC Evals is now METR, METR team, December 4, 2023. Formal announcement of name change and independent 501(c)(3) status. 2 3 4

  3. ARC is Hiring Theoretical Researchers, ARC team, March 20, 2024. Describes Theory team composition (Christiano, Xu, Hilton), temporary members, office-sharing with Redwood Research, and hiring pause as of January 2024. 2 3 4 5 6

  4. Alignment Research Center — General Support, Coefficient Giving, August 15, 2022. Documents $265,000 grant and notes ARC returned $1.25M FTX grant. 2

  5. U.S. Commerce Secretary Announces Expansion of US AI Safety Institute Leadership Team, NIST / U.S. Department of Commerce, April 2024. Christiano named Head of AI Safety at US AISI; role involves designing and conducting tests of frontier AI models. 2 3 4

  6. ARC Evals is spinning out from ARC, METR / ARC Evals team, September 19, 2023. Describes ARC Evals' growth, evaluations of GPT-4 and Claude, UK Taskforce partnership, and rationale for spin-out. 2 3 4

  7. A Bird's Eye View of ARC's Research, ARC team, 2023. Describes the two central subproblems: alignment robustness and ELK.

  8. ELK prize results post, ARC team, 2022, as cited in Edmund Mills prize writeup. ARC's assessment: "a problem we don't know how to solve, where we think rapid progress is being made."

  9. ELK Prize Results, ARC team, 2022. Documents 197 proposals received, 32 prizes of $5k–$20k, 24 honorable mentions of $1k, total $274,000.

  10. ELK First Round Contest Winners, ARC team, 2022. Documents first round: 30 proposals from 25 people, $70,000 awarded to 8 people.

  11. ARC paper: Formalizing the Presumption of Independence, AI Alignment Forum, 2023. Describes heuristic arguments as machine-checkable reasoning that doesn't require perfect certainty; applications include mechanistic anomaly detection.

  12. Research update: Towards a Law of Iterated Expectations for Heuristic Estimators, ARC team, October 2024. Introduces "principle of unpredictable errors"; three applications: mechanistic anomaly detection, safe distillation, low probability estimation. 2 3

  13. Update on ARC's Recent Eval Efforts, ARC Evals team, March 18, 2023. GPT-4 and Claude not capable of autonomous dangerous activities; succeeded at component tasks including CAPTCHA social engineering. 2 3

  14. GPT-4 System Card, OpenAI, March 2023. Describes ARC evaluation of power-seeking behavior; conclusion: "probably not yet capable of autonomously" replicating and acquiring resources. Notes ARC did not have access to the final deployed model version. 2

  15. More Information About the Dangerous Capability Evaluations We Did With GPT-4 and Claude, ARC Evals team, LessWrong, 2023. Describes evaluation methodology and states future more capable models require careful capability checking.

  16. Advanced AI Evaluations at AISI: May Update, UK AI Safety Institute, May 2024. States that long-horizon tasks in UK AISI evaluations were drawn from METR's dataset; confirms ongoing collaboration on methodology and task design. 2

  17. My views on "doom", Paul Christiano, LessWrong, 2023. Christiano estimates 20% probability most humans die within 10 years of building powerful AI; 46% probability of irreversibly compromised future. 2

  18. Catastrophic Misalignment of Large Language Models, Paul Christiano talk at Princeton Language and Intelligence, March 2024.

  19. An Update on METR's Preliminary Evaluations of Claude 3.5 Sonnet and o1, METR team, January 31, 2025. Notes limitations preventing robust capability bounds and states "pre-deployment capability testing is not a sufficient risk management strategy by itself." 2

  20. Common Elements of Frontier AI Safety Policies, METR team, December 2025. Documents sixteen companies' Seoul AI Safety Commitments (May 2024) and twelve companies with published frontier AI safety policies. 2 3

  21. An Update on Our General Capability Evaluations, METR team, August 6, 2024. Describes ~50 automatically scored tasks across cybersecurity, software engineering, and machine learning domains.

  22. METR's Preliminary Evaluation of Claude 3.5 Sonnet, METR team, 2024. Describes evaluation methodology including 38 day-long task attempts by human ML experts as baseline.

References

2Yann LeCun's poststwitter.com

I apologize, but the provided content appears to be an error page from X (formerly Twitter) and does not contain any substantive text from Yann LeCun's posts. Without the actual content of his posts, I cannot generate a meaningful summary. To properly analyze Yann LeCun's posts, I would need: 1. The specific text of his posts 2. Context about the topic he was discussing 3. The source and date of the posts If you can provide the actual content of the posts, I'll be happy to create a comprehensive summary following the requested JSON format. Would you like to: - Recheck the source document - Provide the posts in text form - Choose a different source to analyze

3RANDRAND Corporation

RAND conducts policy research analyzing AI's societal impacts, including potential psychological and national security risks. Their work focuses on understanding AI's complex implications for decision-makers.

★★★★☆
4AI Alignment ForumAlignment Forum·Blog post
★★★☆☆
5LessWrongLessWrong·paulfchristiano, Mark Xu & Ajeya Cotra·2021·Blog post
★★★☆☆
★★★★☆
★★★☆☆
8CAIS SurveysCenter for AI Safety

The Center for AI Safety conducts technical and conceptual research to mitigate potential catastrophic risks from advanced AI systems. They take a comprehensive approach spanning technical research, philosophy, and societal implications.

★★★★☆
★★★★☆
11GovAICentre for the Governance of AI·Government

A research organization focused on understanding AI's societal impacts, governance challenges, and policy implications across various domains like workforce, infrastructure, and public perception.

★★★★☆
12OpenAIcdn.openai.com
★★★★☆
14Paul ChristianoLessWrong·paulfchristiano·2023·Blog post
★★★☆☆
★★★★☆

Structured Data

5 facts·2 recordsView full profile →
Founded Date
Oct 2021

Key People

1
PC
Paul ChristianoFounder
Founder & Head of Research · Oct 2021–present

All Facts

Organization
PropertyValueAs OfSource
Founded DateOct 2021
HeadquartersBerkeley, CA
Legal Structure501(c)(3) nonprofit
People
PropertyValueAs OfSource
Founded ByPaul Christiano
General
PropertyValueAs OfSource
Websitehttp://alignment.org/

Research Areas

1
NameDescriptionStarted
Eliciting Latent Knowledge (ELK)Research on extracting truthful knowledge from AI models regardless of learned deceptive behaviorsDec 2021

Related Pages

Top Related Pages

Safety Research

AI Control

Approaches

Capability ElicitationAI Alignment

Analysis

Deceptive Alignment Decomposition ModelLong-Term Benefit Trust (Anthropic)

Policy

Responsible Scaling PoliciesEU AI Act

Organizations

Apollo Research

Concepts

Situational AwarenessEa Epistemic Failures In The Ftx EraFTX Collapse: Lessons for EA Funding ResilienceLarge Language Models

Risks

Deceptive Alignment

Key Debates

AI Accident Risk CruxesWhy Alignment Might Be HardTechnical AI Safety Research

Other

Sam Bankman-FriedARC-AGIARC-AGI-2

Historical

Mainstream Era