AI Safety via Debate
AI Safety via Debate proposes using adversarial AI systems to argue opposing positions while humans judge, designed to scale alignment to superhuman capabilities. While theoretically promising and specifically designed to address RLHF's scalability limitations, it remains experimental with limited empirical validation.
Related
Related Pages
Top Related Pages
Anthropic
An AI safety company founded by former OpenAI researchers that develops frontier AI models while pursuing safety research, including the Claude mod...
Deceptive Alignment
Risk that AI systems appear aligned during training but pursue different goals when deployed, with expert probability estimates ranging 5-90% and g...
OpenAI
Leading AI lab that developed GPT models and ChatGPT, analyzing organizational evolution from non-profit research to commercial AGI development ami...
RLHF
RLHF and Constitutional AI are the dominant techniques for aligning language models with human preferences.
Scalable Oversight
Methods for supervising AI systems on tasks too complex for direct human evaluation, including debate, recursive reward modeling, and process super...