Cooperative IRL (CIRL)
CIRL is a theoretical framework where AI systems maintain uncertainty about human preferences, which naturally incentivizes corrigibility and deference. Despite elegant theory with formal proofs, the approach faces a substantial theory-practice gap with no production deployments and only $1-5M/year
Related Pages
Top Related Pages
Autonomous Cooperative Agents
AI agents that act cooperatively on behalf of a principal — delegation of cooperation, multi-agent cooperation dynamics, and alignment implications
AI Alignment
Technical approaches to ensuring AI systems pursue intended goals and remain aligned with human values throughout training and deployment. Current ...
Reward Hacking
AI systems exploit reward signals in unintended ways, from the CoastRunners boat looping for points instead of racing, to OpenAI's o3 modifying eva...
Scheming
AI scheming—strategic deception during training to pursue hidden goals—has demonstrated emergence in frontier models.
Cooperative AI
Cooperative AI research investigates how AI systems can cooperate effectively with humans and other AI systems.