Skip to content

Theoretical Foundations

Theoretical alignment research establishes the conceptual and mathematical foundations for safe AI.

Core Concepts:

  • Corrigibility: Systems that allow correction and shutdown
  • Goal Misgeneralization: When learned goals differ from intended goals
  • Agent Foundations: Mathematical foundations of agency

Scalable Oversight:

  • Scalable Oversight: Supervising superhuman systems
  • Eliciting Latent Knowledge (ELK): Getting models to report what they know
  • AI Debate: Using AI to verify AI reasoning

Formal Approaches:

  • Formal Verification: Mathematical proofs of properties
  • Provably Safe AI: Guarantees through formal methods
  • CIRL: Cooperative Inverse Reinforcement Learning

Multi-Agent:

  • Cooperative AI: AI systems that cooperate with humans and each other