Theoretical Foundations (Overview)
Theoretical alignment research establishes the conceptual and mathematical foundations for safe AI.
Core Concepts:
- Corrigibility: Systems that allow correction and shutdown
- Goal Misgeneralization: When learned goals differ from intended goals
- Agent Foundations: Mathematical foundations of agency
Scalable Oversight:
- Scalable Oversight: Supervising superhuman systems
- Eliciting Latent Knowledge (ELK): Getting models to report what they know
- AI Debate: Using AI to verify AI reasoning
Formal Approaches:
- Formal Verification: Mathematical proofs of properties
- Provably Safe AI: Guarantees through formal methods
- CIRL: Cooperative Inverse Reinforcement Learning
Multi-Agent:
- Cooperative AI: AI systems that cooperate with humans and each other