Theoretical Foundations
Theoretical alignment research establishes the conceptual and mathematical foundations for safe AI.
Core Concepts:
- CorrigibilitySafety AgendaCorrigibilityComprehensive review of corrigibility research showing fundamental tensions between goal-directed behavior and shutdown compliance remain unsolved after 10+ years, with 2024-25 empirical evidence r...Quality: 59/100: Systems that allow correction and shutdown
- Goal MisgeneralizationRiskGoal MisgeneralizationGoal misgeneralization occurs when AI systems learn transferable capabilities but pursue wrong objectives in deployment, with 60-80% of RL agents exhibiting this failure mode under distribution shi...Quality: 63/100: When learned goals differ from intended goals
- Agent FoundationsAgent FoundationsAgent foundations research (MIRI's mathematical frameworks for aligned agency) faces low tractability after 10+ years with core problems unsolved, leading to MIRI's 2024 strategic pivot away from t...Quality: 59/100: Mathematical foundations of agency
Scalable Oversight:
- Scalable OversightSafety AgendaScalable OversightProcess supervision achieves 78.2% accuracy on MATH benchmarks (vs 72.4% outcome-based) and is deployed in OpenAI's o1 models, while debate shows 60-80% accuracy on factual questions with +4% impro...Quality: 68/100: Supervising superhuman systems
- Eliciting Latent Knowledge (ELK)Eliciting Latent KnowledgeComprehensive analysis of the Eliciting Latent Knowledge problem with quantified research metrics: ARC's prize contest received 197 proposals, awarded $274K, but $50K and $100K prizes remain unclai...Quality: 91/100: Getting models to report what they know
- AI DebateDebateAI Safety via Debate uses adversarial AI systems arguing opposing positions to enable human oversight of superhuman AI. Recent empirical work shows promising results - debate achieves 88% human acc...Quality: 70/100: Using AI to verify AI reasoning
Formal Approaches:
- Formal VerificationFormal VerificationFormal verification seeks mathematical proofs of AI safety properties but faces a ~100,000x scale gap between verified systems (~10k parameters) and frontier models (~1.7T parameters). While offeri...Quality: 65/100: Mathematical proofs of properties
- Provably Safe AIProvably SafeDavidad's provably safe AI agenda aims to create AI systems with mathematical safety guarantees through formal verification of world models and values, primarily funded by ARIA's £59M Safeguarded A...Quality: 65/100: Guarantees through formal methods
- CIRLCirlCIRL is a theoretical framework where AI systems maintain uncertainty about human preferences, which naturally incentivizes corrigibility and deference. Despite elegant theory with formal proofs, t...Quality: 65/100: Cooperative Inverse Reinforcement Learning
Multi-Agent:
- Cooperative AICooperative AiCooperative AI research addresses multi-agent coordination failures through game theory and mechanism design, with ~$1-20M/year investment primarily at DeepMind and academic groups. The field remai...Quality: 55/100: AI systems that cooperate with humans and each other