Edited today1.9k words3 backlinksUpdated quarterlyDue in 13 weeks
65QualityGood •Quality: 65/100LLM-assigned rating of overall page quality, considering depth, accuracy, and completeness.Structure suggests 10025ImportancePeripheralImportance: 25/100How central this topic is to AI safety. Higher scores mean greater relevance to understanding or mitigating AI risk.8ResearchMinimalResearch Value: 8/100How much value deeper investigation of this topic could yield. Higher scores indicate under-explored topics with high insight potential.
Summary
CIRL is a theoretical framework where AI systems maintain uncertainty about human preferences, which naturally incentivizes corrigibility and deference. Despite elegant theory with formal proofs, the approach faces a substantial theory-practice gap with no production deployments and only \$1-5M/year in academic investment, making it more influential for conceptual foundations than immediate intervention design.
Content7/13
LLM summaryLLM summaryBasic text summary used in search results, entity link tooltips, info boxes, and related page cards.ScheduleScheduleHow often the page should be refreshed. Drives the overdue tracking system.EntityEntityYAML entity definition with type, description, and related entries.Edit historyEdit historyTracked changes from improve pipeline runs and manual edits.crux edit-log view <id>OverviewOverviewA ## Overview heading section that orients readers. Helps with search and AI summaries.
Tables21/ ~8TablesData tables for structured comparisons and reference material.Diagrams1/ ~1DiagramsVisual content — Mermaid diagrams, charts, or Squiggle estimate models.–Int. links12/ ~16Int. linksLinks to other wiki pages. More internal links = better graph connectivity.Add links to other wiki pagesExt. links11/ ~10Ext. linksLinks to external websites, papers, and resources outside the wiki.Footnotes0/ ~6FootnotesFootnote citations [^N] with source references at the bottom of the page.Add [^N] footnote citations–References3/ ~6ReferencesCurated external resources linked via <R> components or cited_by in YAML.Add <R> resource linksQuotes0QuotesSupporting quotes extracted from cited sources to back up page claims.crux citations extract-quotes <id>Accuracy0AccuracyCitations verified against their sources for factual accuracy.crux citations verify <id>RatingsN:3.5 R:5 A:3 C:6RatingsSub-quality ratings: Novelty, Rigor, Actionability, Completeness (0-10 scale).Backlinks3BacklinksNumber of other wiki pages that link to this page. Higher backlink count means better integration into the knowledge graph.
Issues2
QualityRated 65 but structure suggests 100 (underrated by 35 points)
Links4 links could use <R> components
Cooperative IRL (CIRL)
Approach
Cooperative IRL (CIRL)
CIRL is a theoretical framework where AI systems maintain uncertainty about human preferences, which naturally incentivizes corrigibility and deference. Despite elegant theory with formal proofs, the approach faces a substantial theory-practice gap with no production deployments and only \$1-5M/year in academic investment, making it more influential for conceptual foundations than immediate intervention design.
1.9k words · 3 backlinks
Quick Assessment
Dimension
Assessment
Evidence
Tractability
Medium
Requires bridging theory-practice gap for neural networks
Needs fundamental advances in deep learning integration
Key Proponents
UC Berkeley CHAI
Stuart RussellPersonStuart RussellStuart Russell (born 1962) is a British computer scientist and UC Berkeley professor who co-authored the dominant AI textbook 'Artificial Intelligence: A Modern Approach' (used in over 1,500 univer...Quality: 30/100, Anca Dragan, Dylan Hadfield-Menell
Annual Investment
$1-5M/year
Primarily academic grants
Overview
Cooperative Inverse Reinforcement Learning (CIRL), also known as Cooperative IRL or Assistance Games, is a theoretical framework developed at UC Berkeley's Center for Human-Compatible AIOrganizationCenter for Human-Compatible AICHAI is UC Berkeley's AI safety research center founded by Stuart Russell in 2016, pioneering cooperative inverse reinforcement learning and human-compatible AI frameworks. The center has trained 3...Quality: 37/100 (CHAI) that reconceptualizes the AI alignmentApproachAI AlignmentComprehensive review of AI alignment approaches finding current methods (RLHF, Constitutional AI) show 75%+ effectiveness on measurable safety metrics for existing systems but face critical scalabi...Quality: 91/100 problem as a cooperative game between humans and AI systems. Unlike standard reinforcement learning where agents optimize a fixed reward function, CIRL agents maintain uncertainty about human preferences and learn these preferences through interaction while cooperating with humans to maximize expected value under this uncertainty.
The key insight is that an AI system uncertain about what humans want has incentive to remain corrigible - to allow itself to be corrected, to seek clarification, and to avoid actions with irreversible consequences. If the AI might be wrong about human values, acting cautiously and deferring to human judgment becomes instrumentally valuable rather than requiring explicit constraints. This addresses the corrigibility problem at a deeper level than approaches that try to add constraints on top of a capable optimizer.
CIRL represents some of the most rigorous theoretical work in AI alignment, with formal proofs about agent behavior under various assumptions. However, it faces significant challenges in practical application: the framework assumes access to human reward functions in a way that doesn't translate directly to training large language modelsCapabilityLarge Language ModelsComprehensive analysis of LLM capabilities showing rapid progress from GPT-2 (1.5B parameters, 2019) to GPT-5 and Gemini 2.5 (2025), with training costs growing 2.4x annually and projected to excee...Quality: 60/100, and the gap between CIRL's elegant theory and the messy reality of deep learning remains substantial. Current investment ($1-5M/year) remains primarily academic, though the theoretical foundations influence broader thinking about alignment. Recent work on AssistanceZero (Laidlaw et al., 2025) demonstrates the first scalable approach to solving assistance games, suggesting the theory-practice gap may be narrowing.
How It Works
Loading diagram...
The CIRL framework reconceptualizes AI alignment as a two-player cooperative game. Unlike standard inverse reinforcement learning where the robot passively observes a human assumed to act optimally, CIRL models both agents as actively cooperating. The human knows their preferences but the robot does not; crucially, both agents share the same reward function (the human's). This shared objective creates natural incentives for the human to teach and the robot to learn without explicitly programming these behaviors.
The robot maintains a probability distribution over possible human preferences and takes actions that maximize expected reward under this uncertainty. When the robot is uncertain, it has instrumental reasons to: (1) seek clarification from the human, (2) avoid irreversible actions, and (3) accept being shut down if the human initiates shutdown. This is the key insight: corrigibility emerges from uncertainty rather than being imposed as a constraint.
Risks Addressed
Risk
Relevance
How CIRL Helps
Goal MisgeneralizationRiskGoal MisgeneralizationGoal misgeneralization occurs when AI systems learn transferable capabilities but pursue wrong objectives in deployment, with 60-80% of RL agents exhibiting this failure mode under distribution shi...Quality: 63/100
High
Maintains uncertainty rather than locking onto inferred goals
Corrigibility FailuresSafety AgendaCorrigibilityComprehensive review of corrigibility research showing fundamental tensions between goal-directed behavior and shutdown compliance remain unsolved after 10+ years, with 2024-25 empirical evidence r...Quality: 59/100
High
Uncertainty creates instrumental incentive to accept correction
Reward HackingRiskReward HackingComprehensive analysis showing reward hacking occurs in 1-2% of OpenAI o3 task attempts, with 43x higher rates when scoring functions are visible. Mathematical proof establishes it's inevitable for...Quality: 91/100
Medium
Human remains in loop to refine reward signal
Deceptive AlignmentRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100
Medium
Information-seeking behavior conflicts with deception incentives
SchemingRiskSchemingScheming—strategic AI deception during training—has transitioned from theoretical concern to observed behavior across all major frontier models (o1: 37% alignment faking, Claude: 14% harmful compli...Quality: 74/100
Low-Medium
Deference to humans limits autonomous scheming
Risk Assessment & Impact
Risk Category
Assessment
Key Metrics
Evidence Source
Safety Uplift
Medium
Encourages corrigibility through uncertainty
Theoretical analysis
Capability Uplift
Neutral
Not primarily a capability technique
By design
Net World Safety
Helpful
Good theoretical foundations
CHAI research
Lab Incentive
Weak
Mostly academic; limited commercial pull
Structural
The Cooperative Game Setup
CIRL formulates the AI alignment problem as a two-player cooperative game:
Player
Role
Knowledge
Objective
Human (H)
Acts, provides information
Knows own preferences (θ)
Maximize expected reward
Robot (R)
Acts, learns preferences
Uncertain about θ
Maximize expected reward given uncertainty about θ
Key Mathematical Properties
Property
Description
Safety Implication
Uncertainty Maintenance
Robot maintains distribution over human values
Avoids overconfident wrong actions
Value of Information
Robot values learning about preferences
Seeks clarification naturally
Corrigibility
Emerges from uncertainty, not constraints
More robust than imposed rules
Preference Inference
Robot learns from human actions
Human can teach through behavior
Why Uncertainty Encourages Corrigibility
In the CIRL framework, an uncertain agent has several beneficial properties:
Behavior
Mechanism
Benefit
Accepts Correction
Might be wrong, so human correction is valuable information
Natural shutdown acceptance
Avoids Irreversibility
High-impact actions might be wrong direction
Conservative action selection
Seeks Clarification
Information about preferences is valuable
Active value learning
Defers to Humans
Human actions are signals about preferences
Human judgment incorporated
Theoretical Foundations
Comparison to Standard RL
Aspect
Standard RL
CIRL
Reward Function
Known and fixed
Unknown, to be learned
Agent's Goal
Maximize known reward
Maximize expected reward under uncertainty
Human's Role
Provides reward signal
Active player with own actions
Correction
Orthogonal to optimization
Integral to optimization
Key Theorems and Results
Result
Description
Significance
Value Alignment Theorem
Under certain conditions, CIRL agent learns human preferences
Provides formal alignment guarantee
Corrigibility Emergence
Uncertain agent prefers shutdown over wrong action
Corrigibility without hardcoding
Information Value
Positive value of information about preferences
Explains deference behavior
Off-Switch Game
Traditional agents disable off-switches; CIRL agents accept shutdown
CIRL's theoretical properties scale well in principle:
Factor
Scalability
Notes
Uncertainty Representation
Scales with compute
Can represent complex beliefs
Corrigibility Incentive
Maintained at scale
Built into objective
Preference Learning
Improves with interaction
More data helps
Practical Scalability
The challenges are in implementation:
Challenge
Description
Status
Deep Learning Integration
How to maintain uncertainty in neural networks
Open problem
Reward Function Complexity
Human values are complex
Difficult to represent
Interaction Requirements
Requires active human interaction
Expensive
Approximation Errors
Real implementations approximate
May lose guarantees
Current Research & Investment
Metric
Value
Notes
Annual Investment
$1-5M/year
Primarily academic
Adoption Level
None (academic)
No production deployment
Primary Research
UC Berkeley CHAI
Stuart RussellPersonStuart RussellStuart Russell (born 1962) is a British computer scientist and UC Berkeley professor who co-authored the dominant AI textbook 'Artificial Intelligence: A Modern Approach' (used in over 1,500 univer...Quality: 30/100's group
Recommendation
Increase
Good foundations; needs practical work
Research Directions
Direction
Status
Potential Impact
Scalable Assistance Games
Active (2025)
AssistanceZero demonstrates tractability in complex environments
RLHFCapabilityRLHFRLHF/Constitutional AI achieves 82-85% preference improvements and 40.8% adversarial attack reduction for current systems, but faces fundamental scalability limits: weak-to-strong supervision shows...Quality: 63/100: CIRL provides theoretical foundation; RLHF is practical approximation
Reward ModelingApproachReward ModelingReward modeling, the core component of RLHF receiving \$100M+/year investment, trains neural networks on human preference comparisons to enable scalable reinforcement learning. The technique is uni...Quality: 55/100: CIRL explains why learned rewards should include uncertainty
CorrigibilitySafety AgendaCorrigibilityComprehensive review of corrigibility research showing fundamental tensions between goal-directed behavior and shutdown compliance remain unsolved after 10+ years, with 2024-25 empirical evidence r...Quality: 59/100
Risks
Deceptive AlignmentRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100Goal MisgeneralizationRiskGoal MisgeneralizationGoal misgeneralization occurs when AI systems learn transferable capabilities but pursue wrong objectives in deployment, with 60-80% of RL agents exhibiting this failure mode under distribution shi...Quality: 63/100
Analysis
Instrumental Convergence FrameworkAnalysisInstrumental Convergence FrameworkQuantitative framework finding self-preservation converges in 95-99% of AI goal structures with 70-95% pursuit likelihood, while goal-content integrity shows 90-99% convergence creating detection c...Quality: 60/100
Approaches
Reward ModelingApproachReward ModelingReward modeling, the core component of RLHF receiving \$100M+/year investment, trains neural networks on human preference comparisons to enable scalable reinforcement learning. The technique is uni...Quality: 55/100AI Safety via DebateApproachAI Safety via DebateAI Safety via Debate uses adversarial AI systems arguing opposing positions to enable human oversight of superhuman AI. Recent empirical work shows promising results - debate achieves 88% human acc...Quality: 70/100
Concepts
Large Language ModelsCapabilityLarge Language ModelsComprehensive analysis of LLM capabilities showing rapid progress from GPT-2 (1.5B parameters, 2019) to GPT-5 and Gemini 2.5 (2025), with training costs growing 2.4x annually and projected to excee...Quality: 60/100RLHFCapabilityRLHFRLHF/Constitutional AI achieves 82-85% preference improvements and 40.8% adversarial attack reduction for current systems, but faces fundamental scalability limits: weak-to-strong supervision shows...Quality: 63/100Alignment Theoretical OverviewAlignment Theoretical OverviewThis is a pure navigation/index page listing theoretical alignment concepts (corrigibility, ELK, CIRL, formal verification, etc.) with one-line descriptions and entity links, containing no substant...Quality: 22/100
Organizations
Center for Human-Compatible AIOrganizationCenter for Human-Compatible AICHAI is UC Berkeley's AI safety research center founded by Stuart Russell in 2016, pioneering cooperative inverse reinforcement learning and human-compatible AI frameworks. The center has trained 3...Quality: 37/100
Other
Stuart RussellPersonStuart RussellStuart Russell (born 1962) is a British computer scientist and UC Berkeley professor who co-authored the dominant AI textbook 'Artificial Intelligence: A Modern Approach' (used in over 1,500 univer...Quality: 30/100