Longterm Wiki

Cooperative IRL (CIRL)

CIRL is a theoretical framework where AI systems maintain uncertainty about human preferences, which naturally incentivizes corrigibility and deference. Despite elegant theory with formal proofs, the approach faces a substantial theory-practice gap with no production deployments and only $1-5M/year

Related Pages

Top Related Pages

Risks

Deceptive AlignmentGoal Misgeneralization

Analysis

Instrumental Convergence Framework

Approaches

Reward ModelingAI Safety via Debate

Concepts

Large Language ModelsAlignment Theoretical Overview

Other

RLHFCorrigibilityStuart Russell

Organizations

Center for Human-Compatible AI