Edited 1 day ago1.1k words47 backlinksUpdated every 6 weeksDue in 6 weeks
39QualityDraft •Quality: 39/100LLM-assigned rating of overall page quality, considering depth, accuracy, and completeness.Structure suggests 7328ImportancePeripheralImportance: 28/100How central this topic is to AI safety. Higher scores mean greater relevance to understanding or mitigating AI risk.36ResearchLowResearch Value: 36/100How much value deeper investigation of this topic could yield. Higher scores indicate under-explored topics with high insight potential.
Summary
Comprehensive biography of Paul Christiano documenting his technical contributions (IDA, debate, scalable oversight), risk assessment (~10-20% P(doom), AGI 2030s-2040s), and evolution from higher optimism to current moderate concern. Documents implementation of his ideas at major labs (RLHF at OpenAI, Constitutional AI at Anthropic) with specific citation to papers and organizational impact.
Content7/13
LLM summaryLLM summaryBasic text summary used in search results, entity link tooltips, info boxes, and related page cards.ScheduleScheduleHow often the page should be refreshed. Drives the overdue tracking system.EntityEntityYAML entity definition with type, description, and related entries.Edit historyEdit historyTracked changes from improve pipeline runs and manual edits.crux edit-log view <id>OverviewOverviewA ## Overview heading section that orients readers. Helps with search and AI summaries.
Tables12/ ~4TablesData tables for structured comparisons and reference material.Diagrams0DiagramsVisual content — Mermaid diagrams, charts, or Squiggle estimate models.Add Mermaid diagrams or Squiggle modelsInt. links46/ ~9Int. linksLinks to other wiki pages. More internal links = better graph connectivity.Ext. links0/ ~6Ext. linksLinks to external websites, papers, and resources outside the wiki.Add links to external sourcesFootnotes0/ ~3FootnotesFootnote citations [^N] with source references at the bottom of the page.Add [^N] footnote citationsReferences14/ ~3ReferencesCurated external resources linked via <R> components or cited_by in YAML.Quotes0QuotesSupporting quotes extracted from cited sources to back up page claims.crux citations extract-quotes <id>Accuracy0AccuracyCitations verified against their sources for factual accuracy.crux citations verify <id>RatingsN:2 R:4.5 A:2 C:6RatingsSub-quality ratings: Novelty, Rigor, Actionability, Completeness (0-10 scale).Backlinks47BacklinksNumber of other wiki pages that link to this page. Higher backlink count means better integration into the knowledge graph.
Issues1
QualityRated 39 but structure suggests 73 (underrated by 34 points)
Paul Christiano
Person
Paul Christiano
Comprehensive biography of Paul Christiano documenting his technical contributions (IDA, debate, scalable oversight), risk assessment (~10-20% P(doom), AGI 2030s-2040s), and evolution from higher optimism to current moderate concern. Documents implementation of his ideas at major labs (RLHF at OpenAI, Constitutional AI at Anthropic) with specific citation to papers and organizational impact.
Alignment Research CenterOrganizationAlignment Research CenterComprehensive reference page on ARC (Alignment Research Center), covering its evolution from a dual theory/evals organization to ARC Theory (3 permanent researchers) plus the METR spin-out (Decembe...Quality: 57/100
Safety Agendas
Scalable OversightSafety AgendaScalable OversightProcess supervision achieves 78.2% accuracy on MATH benchmarks (vs 72.4% outcome-based) and is deployed in OpenAI's o1 models, while debate shows 60-80% accuracy on factual questions with +4% impro...Quality: 68/100
People
Eliezer YudkowskyPersonEliezer YudkowskyComprehensive biographical profile of Eliezer Yudkowsky covering his foundational contributions to AI safety (CEV, early problem formulation, agent foundations) and notably pessimistic views on AI ...Quality: 35/100Jan LeikePersonJan LeikeBiography of Jan Leike covering his career from Australian National University through DeepMind, OpenAI's Superalignment team, to his current role as head of the Alignment Science team at Anthropic...Quality: 27/100
1.1k words · 47 backlinks
Person
Paul Christiano
Comprehensive biography of Paul Christiano documenting his technical contributions (IDA, debate, scalable oversight), risk assessment (~10-20% P(doom), AGI 2030s-2040s), and evolution from higher optimism to current moderate concern. Documents implementation of his ideas at major labs (RLHF at OpenAI, Constitutional AI at Anthropic) with specific citation to papers and organizational impact.
Alignment Research CenterOrganizationAlignment Research CenterComprehensive reference page on ARC (Alignment Research Center), covering its evolution from a dual theory/evals organization to ARC Theory (3 permanent researchers) plus the METR spin-out (Decembe...Quality: 57/100
Safety Agendas
Scalable OversightSafety AgendaScalable OversightProcess supervision achieves 78.2% accuracy on MATH benchmarks (vs 72.4% outcome-based) and is deployed in OpenAI's o1 models, while debate shows 60-80% accuracy on factual questions with +4% impro...Quality: 68/100
People
Eliezer YudkowskyPersonEliezer YudkowskyComprehensive biographical profile of Eliezer Yudkowsky covering his foundational contributions to AI safety (CEV, early problem formulation, agent foundations) and notably pessimistic views on AI ...Quality: 35/100Jan LeikePersonJan LeikeBiography of Jan Leike covering his career from Australian National University through DeepMind, OpenAI's Superalignment team, to his current role as head of the Alignment Science team at Anthropic...Quality: 27/100
1.1k words · 47 backlinks
Overview
Paul ChristianoPersonPaul ChristianoComprehensive biography of Paul Christiano documenting his technical contributions (IDA, debate, scalable oversight), risk assessment (~10-20% P(doom), AGI 2030s-2040s), and evolution from higher o...Quality: 39/100 is one of the most influential researchers in AI alignment, known for developing concrete, empirically testable approaches to the alignment problem. With a PhD in theoretical computer science from UC Berkeley, he has worked at OpenAI, DeepMind, and founded the Alignment Research Center (ARC)OrganizationAlignment Research CenterComprehensive reference page on ARC (Alignment Research Center), covering its evolution from a dual theory/evals organization to ARC Theory (3 permanent researchers) plus the METR spin-out (Decembe...Quality: 57/100.
Christiano pioneered the "prosaic alignment" approach—aligning AI without requiring exotic theoretical breakthroughs. His current risk assessment places ~10-20% probability on existential risk from AI this century, with AGI arrival in the 2030s-2040s. His work has directly influenced alignment research programs at major labs including OpenAIOrganizationOpenAIComprehensive organizational profile of OpenAI documenting evolution from 2015 non-profit to Public Benefit Corporation, with detailed analysis of governance crisis, 2024-2025 ownership restructuri...Quality: 62/100, AnthropicOrganizationAnthropicComprehensive reference page on Anthropic covering financials (\$380B valuation, \$14B ARR), safety research (Constitutional AI, mechanistic interpretability, model welfare), governance (LTBT struc...Quality: 74/100, and DeepMindOrganizationGoogle DeepMindComprehensive overview of DeepMind's history, achievements (AlphaGo, AlphaFold with 200M+ protein structures), and 2023 merger with Google Brain. Documents racing dynamics with OpenAI and new Front...Quality: 37/100.
Risk Assessment
Risk Factor
Christiano's Assessment
Evidence/Reasoning
Comparison to Field
P(doom)
≈10-20%
Alignment tractable but challenging
Moderate (vs 50%+ doomers, <5% optimists)
AGI TimelineConceptAGI TimelineComprehensive synthesis of AGI timeline forecasts showing dramatic acceleration: expert median dropped from 2061 (2018) to 2047 (2023), Metaculus from 50 years to 5 years since 2020, with current p...Quality: 59/100
2030s-2040s
Gradual capability increase
Mainstream range
Alignment Difficulty
Hard but tractable
Iterative progress possible
More optimistic than MIRIOrganizationMachine Intelligence Research InstituteComprehensive organizational history documenting MIRI's trajectory from pioneering AI safety research (2000-2020) to policy advocacy after acknowledging research failure, with detailed financial da...Quality: 50/100
Human overseer works with AI assistant on complex tasks
Tested at scale by OpenAI↗🔗 web★★★★☆OpenAIOpenAIiterated-amplificationscalable-oversightai-safety-via-debateSource ↗
Distillation
Extract human+AI behavior into standalone AI system
Standard ML technique
Iteration
Repeat process with increasingly capable systems
Theoretical framework
Bootstrapping
Build aligned AGI from aligned weak systems
Core theoretical hope
Key insight: If we can align a weak system and use it to help align slightly stronger systems, we can bootstrap to aligned AGI without solving the full problem directly.
AI Safety via DebateApproachAI Safety via DebateAI Safety via Debate uses adversarial AI systems arguing opposing positions to enable human oversight of superhuman AI. Recent empirical work shows promising results - debate achieves 88% human acc...Quality: 70/100
Co-developed with Geoffrey Irving↗🔗 web★★★★☆Google ScholarGeoffrey Irvingiterated-amplificationscalable-oversightai-safety-via-debateSource ↗ at DeepMind in "AI safety via debate"↗📄 paper★★★☆☆arXivDebate as Scalable OversightGeoffrey Irving, Paul Christiano, Dario Amodei (2018)alignmentsafetytrainingcompute+1Source ↗ (2018):
Mechanism
Implementation
Results
Adversarial TrainingApproachAdversarial TrainingAdversarial training, universally adopted at frontier labs with \$10-150M/year investment, improves robustness to known attacks but creates an arms race dynamic and provides no protection against m...Quality: 58/100
Two AIs argue for different positions
Deployed at Anthropic↗🔗 web★★★★☆AnthropicAnthropic'srisk-factorcompetitiongame-theoryiterated-amplification+1Source ↗
Human Judgment
Human evaluates which argument is more convincing
Scales human oversight capability
Truth Discovery
Debate incentivizes finding flaws in opponent arguments
Mixed empirical results
Scalability
Works even when AIs are smarter than humans
Theoretical hope
Scalable OversightSafety AgendaScalable OversightProcess supervision achieves 78.2% accuracy on MATH benchmarks (vs 72.4% outcome-based) and is deployed in OpenAI's o1 models, while debate shows 60-80% accuracy on factual questions with +4% impro...Quality: 68/100 Framework
Christiano's broader research program on supervising superhuman AI:
Problem
Proposed Solution
Current Status
Task too complex for direct evaluation
Process-based feedback vs outcome evaluation
Implemented at OpenAI↗🔗 web★★★★☆OpenAIImplemented at OpenAIiterated-amplificationscalable-oversightai-safety-via-debateSource ↗
AI reasoning opaque to humans
Eliciting Latent Knowledge (ELK)
Active research area
Deceptive alignment
Recursive reward modelingApproachReward ModelingReward modeling, the core component of RLHF receiving \$100M+/year investment, trains neural networks on human preference comparisons to enable scalable reinforcement learning. The technique is uni...Quality: 55/100
Early stage research
Capability-alignment gap
Assistance games framework
Theoretical foundation
Intellectual Evolution and Current Views
Early Period (2016-2019)
Higher optimism: Alignment seemed more tractable
IDA focus: Believed iterative amplification could solve core problems
Paul ChristianoYes, alignment tax should be acceptable, we can catch problems in weaker systems
Prosaic alignment through iterative improvement
Confidence: medium-high
Eliezer YudkowskyNo, sharp capability jumps mean we won't get useful feedback
Deceptive alignment, treacherous turns, alignment is anti-natural
Confidence: high
Jan LeikeYes, but we need to move fast as capabilities advance rapidly
Similar to Paul but more urgency given current pace
Confidence: medium
Core Crux Positions
Issue
Christiano's View
Alternative Views
Implication
Alignment difficulty
Prosaic solutions sufficient
Need fundamental breakthroughs (MIRIOrganizationMachine Intelligence Research InstituteComprehensive organizational history documenting MIRI's trajectory from pioneering AI safety research (2000-2020) to policy advocacy after acknowledging research failure, with detailed financial da...Quality: 50/100)
Different research priorities
Takeoff speeds
Gradual, time to iterate
Fast, little warning
Different preparation strategies
Coordination feasibility
Moderately optimistic
Pessimistic (racing dynamicsRiskAI Development Racing DynamicsRacing dynamics analysis shows competitive pressure has shortened safety evaluation timelines by 40-60% since ChatGPT's launch, with commercial labs reducing safety work from 12 weeks to 4-6 weeks....Quality: 72/100)
Different governance approaches
Current system alignment
Meaningful progress possible
Current systems too limited
Different research timing
Research Influence and Impact
Direct Implementation
Technique
Organization
Implementation
Results
RLHF
OpenAI
InstructGPT, ChatGPT
Massive improvement in helpfulness
Constitutional AIApproachConstitutional AIConstitutional AI is Anthropic's methodology using explicit principles and AI-generated feedback (RLAIF) to train safer models, achieving 3-10x improvements in harmlessness while maintaining helpfu...Quality: 70/100
Anthropic
Claude training
Reduced harmful outputs
Debate methods
DeepMind
Sparrow
Mixed results on truthfulness
Process supervisionApproachProcess SupervisionProcess supervision trains AI to show correct reasoning steps rather than just final answers, achieving 15-25% absolute improvements on math benchmarks while making reasoning auditable. However, it...Quality: 65/100
OpenAI
Math reasoning
Better than outcome supervision
Intellectual Leadership
AI AlignmentApproachAI AlignmentComprehensive review of AI alignment approaches finding current methods (RLHF, Constitutional AI) show 75%+ effectiveness on measurable safety metrics for existing systems but face critical scalabi...Quality: 91/100 Forum↗✏️ blog★★★☆☆Alignment ForumAI Alignment Forumalignmenttalentfield-buildingcareer-transitions+1Source ↗: Primary venue for technical alignment discourse
Mentorship: Trained researchers now at major labs (Jan LeikePersonJan LeikeBiography of Jan Leike covering his career from Australian National University through DeepMind, OpenAI's Superalignment team, to his current role as head of the Alignment Science team at Anthropic...Quality: 27/100, Geoffrey Irving, others)
Problem formulation: ELK problem now central focus across field
Current Research Agenda (2024)
At ARCOrganizationAlignment Research CenterComprehensive reference page on ARC (Alignment Research Center), covering its evolution from a dual theory/evals organization to ARC Theory (3 permanent researchers) plus the METR spin-out (Decembe...Quality: 57/100, Christiano's priorities include:
Research Area
Specific Focus
Timeline
Power-seeking evaluation
Understanding how AI systems could gain influence gradually
Ongoing
Scalable oversight
Better techniques for supervising superhuman systems
Core program
Alignment evaluation
Metrics for measuring alignment progress
Near-term
Governance research
Coordination mechanisms between labs
Policy-relevant
Key Uncertainties and Cruxes
Christiano identifies several critical uncertainties:
Uncertainty
Why It Matters
Current Evidence
Deceptive alignment prevalence
Determines safety of iterative approach
Mixed signals from current systemsRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100
Capability jump sizes
Affects whether we get warning
Continuous but accelerating progress
Coordination feasibility
Determines governance strategies
Some positive signsPolicyVoluntary AI Safety CommitmentsComprehensive empirical analysis of voluntary AI safety commitments showing 53% mean compliance rate across 30 indicators (ranging from 13% for Apple to 83% for OpenAI), with strongest adoption in ...Quality: 91/100
Alignment tax magnitude
Economic feasibility of safety
Early evidence suggests low tax
Timeline and Trajectory Assessment
Near-term (2024-2027)
Continued capability advances in language modelsCapabilityLarge Language ModelsComprehensive analysis of LLM capabilities showing rapid progress from GPT-2 (1.5B parameters, 2019) to GPT-5 and Gemini 2.5 (2025), with training costs growing 2.4x annually and projected to excee...Quality: 60/100
Better alignment evaluation methods
Industry coordination on safety standards
Medium-term (2027-2032)
Early agentic AICapabilityAgentic AIAnalysis of agentic AI capabilities and deployment challenges, documenting industry forecasts (40% of enterprise apps by 2026, \$199B market by 2034) alongside implementation difficulties (40%+ pro...Quality: 68/100 systems
Critical tests of scalable oversight
Potential governance frameworks
Long-term (2032-2040)
Approach to transformative AI
Make-or-break period for alignment
International coordination becomes crucial
Comparison with Other Researchers
Researcher
P(doom)
Timeline
Alignment Approach
Coordination View
Paul Christiano
≈15%
2030s
Prosaic, iterative
Moderately optimistic
Eliezer YudkowskyPersonEliezer YudkowskyComprehensive biographical profile of Eliezer Yudkowsky covering his foundational contributions to AI safety (CEV, early problem formulation, agent foundations) and notably pessimistic views on AI ...Quality: 35/100
≈90%
2020s
Fundamental theory
Pessimistic
Dario AmodeiPersonDario AmodeiComprehensive biographical profile of Anthropic CEO Dario Amodei documenting his competitive safety development philosophy, 10-25% catastrophic risk estimate, 2026-2030 AGI timeline, and Constituti...Quality: 41/100
≈10-25%
2030s
Constitutional AI
Industry-focused
Stuart RussellPersonStuart RussellStuart Russell (born 1962) is a British computer scientist and UC Berkeley professor who co-authored the dominant AI textbook 'Artificial Intelligence: A Modern Approach' (used in over 1,500 univer...Quality: 30/100
AI safety via debate↗📄 paper★★★☆☆arXivDebate as Scalable OversightGeoffrey Irving, Paul Christiano, Dario Amodei (2018)alignmentsafetytrainingcompute+1Source ↗
Scalable oversightSafety AgendaInterpretabilityMechanistic interpretability has extracted 34M+ interpretable features from Claude 3 Sonnet with 90% automated labeling accuracy and demonstrated 75-85% success in causal validation, though less th...Quality: 66/100
Core research focus
Reward modelingCapabilityRLHFRLHF/Constitutional AI achieves 82-85% preference improvements and 40.8% adversarial attack reduction for current systems, but faces fundamental scalability limits: weak-to-strong supervision shows...Quality: 63/100
Foundation for many proposals
AI governance
Increasing focus area
Alignment evaluationApproachAI EvaluationComprehensive overview of AI evaluation methods spanning dangerous capability assessment, safety properties, and deception detection, with categorized frameworks from industry (Anthropic Constituti...Quality: 72/100
PhD in Computer Science, UC Berkeley; BS in Mathematics, MIT
Notable For
Pioneer of RLHF and AI alignment research; founder of Alignment Research Center (ARC); key theorist of iterated amplification and eliciting latent knowledge
METROrganizationMETRMETR conducts pre-deployment dangerous capability evaluations for frontier AI labs (OpenAI, Anthropic, Google DeepMind), testing autonomous replication, cybersecurity, CBRN, and manipulation capabi...Quality: 66/100NIST and AI SafetyOrganizationNIST and AI SafetyNIST plays a central coordinating role in U.S. AI governance through voluntary standards and risk management frameworks, but faces criticism for technical focus over systemic issues and funding con...Quality: 63/100
Analysis
Model Organisms of MisalignmentAnalysisModel Organisms of MisalignmentModel organisms of misalignment is a research agenda creating controlled AI systems exhibiting specific alignment failures as testbeds. Recent work achieves 99% coherence with 40% misalignment rate...Quality: 65/100Capability-Alignment Race ModelAnalysisCapability-Alignment Race ModelQuantifies the capability-alignment race showing capabilities currently ~3 years ahead of alignment readiness, with gap widening at 0.5 years/year driven by 10²⁶ FLOP scaling vs. 15% interpretabili...Quality: 62/100
Approaches
AI AlignmentApproachAI AlignmentComprehensive review of AI alignment approaches finding current methods (RLHF, Constitutional AI) show 75%+ effectiveness on measurable safety metrics for existing systems but face critical scalabi...Quality: 91/100AI Safety via DebateApproachAI Safety via DebateAI Safety via Debate uses adversarial AI systems arguing opposing positions to enable human oversight of superhuman AI. Recent empirical work shows promising results - debate achieves 88% human acc...Quality: 70/100
Concepts
AI TimelinesConceptAI TimelinesForecasts and debates about when transformative AI capabilities will be developedQuality: 95/100Existential Risk from AIConceptExistential Risk from AIHypotheses concerning risks from advanced AI systems that some researchers believe could result in human extinction or permanent global catastropheQuality: 92/100Agentic AICapabilityAgentic AIAnalysis of agentic AI capabilities and deployment challenges, documenting industry forecasts (40% of enterprise apps by 2026, \$199B market by 2034) alongside implementation difficulties (40%+ pro...Quality: 68/100Large Language ModelsCapabilityLarge Language ModelsComprehensive analysis of LLM capabilities showing rapid progress from GPT-2 (1.5B parameters, 2019) to GPT-5 and Gemini 2.5 (2025), with training costs growing 2.4x annually and projected to excee...Quality: 60/100
Risks
Deceptive AlignmentRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100AI Development Racing DynamicsRiskAI Development Racing DynamicsRacing dynamics analysis shows competitive pressure has shortened safety evaluation timelines by 40-60% since ChatGPT's launch, with commercial labs reducing safety work from 12 weeks to 4-6 weeks....Quality: 72/100
Key Debates
AI Alignment Research AgendasCruxAI Alignment Research AgendasComprehensive comparison of major AI safety research agendas (\$100M+ Anthropic, \$50M+ DeepMind, \$5-10M nonprofits) with detailed funding, team sizes, and failure mode coverage (25-65% per agenda...Quality: 69/100AI Accident Risk CruxesCruxAI Accident Risk CruxesComprehensive survey of AI safety researcher disagreements on accident risks, quantifying probability ranges for mesa-optimization (15-55%), deceptive alignment (15-50%), and P(doom) (5-35% median ...Quality: 67/100The Case For AI Existential RiskArgumentThe Case For AI Existential RiskComprehensive formal argument that AI poses 5-14% median extinction risk by 2100 (per 2,788 researcher survey), structured around four premises: capabilities will advance, alignment is hard (with d...Quality: 66/100Why Alignment Might Be HardArgumentWhy Alignment Might Be HardComprehensive synthesis of arguments for why AI alignment is technically difficult, covering specification problems (value complexity, Goodhart's Law, output-centric reframings), inner alignment fa...Quality: 61/100
Policy
Voluntary AI Safety CommitmentsPolicyVoluntary AI Safety CommitmentsComprehensive empirical analysis of voluntary AI safety commitments showing 53% mean compliance rate across 30 indicators (ranging from 13% for Apple to 83% for OpenAI), with strongest adoption in ...Quality: 91/100
Safety Research
AI ControlSafety AgendaAI ControlAI Control is a defensive safety approach that maintains control over potentially misaligned AI through monitoring, containment, and redundancy, offering 40-60% catastrophic risk reduction if align...Quality: 75/100
Historical
Deep Learning Revolution EraHistoricalDeep Learning Revolution EraComprehensive timeline documenting 2012-2020 AI capability breakthroughs (AlexNet, AlphaGo, GPT-3) and parallel safety field development, with quantified metrics showing capabilities funding outpac...Quality: 44/100The MIRI EraHistoricalThe MIRI EraComprehensive chronological account of AI safety's institutional emergence (2000-2015), from MIRI's founding through Bostrom's Superintelligence to mainstream recognition. Covers key organizations,...Quality: 31/100