Longterm Wiki

AI Safety via Debate

AI Safety via Debate proposes using adversarial AI systems to argue opposing positions while humans judge, designed to scale alignment to superhuman capabilities. While theoretically promising and specifically designed to address RLHF's scalability limitations, it remains experimental with limited empirical validation.

Related

Related Pages

Top Related Pages

Risks

Reward HackingSchemingSycophancy

Approaches

Weak-to-Strong GeneralizationProcess SupervisionReward ModelingEliciting Latent Knowledge (ELK)AI AlignmentCooperative IRL (CIRL)

Key Debates

AI Accident Risk CruxesAI Alignment Research AgendasTechnical AI Safety Research

Other

Mechanistic InterpretabilityPaul ChristianoJan LeikeIlya Sutskever

Concepts

Alignment Theoretical Overview

Organizations

Alignment Research CenterGoogle DeepMind

Historical

Mainstream Era

Tags

scalable-oversightadversarial-methodssuperhuman-alignmentalignment-theoryhuman-judgment