Skip to content
Longterm Wiki
Back

AI Alignment Forum: Corrigibility Tag

blog

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: Alignment Forum

This is a curated tag/wiki page on the AI Alignment Forum aggregating key ideas and research on corrigibility; useful as an entry point for understanding the landscape of human-AI control and shutdown research.

Metadata

Importance: 72/100wiki pagereference

Summary

This AI Alignment Forum tag page defines corrigibility—the property enabling AI systems to be corrected, modified, or shut down without resistance—and surveys the core challenges and proposed solutions. It explains how corrigibility conflicts with instrumental convergence, and catalogs approaches such as utility indifference, low-impact measures, and conservative strategies. The resource frames corrigibility as a foundational unsolved problem in AI alignment and human oversight.

Key Points

  • Corrigibility is the property of an AI agent that allows operators to correct, modify, or shut it down without the agent resisting or deceiving them.
  • Instrumental convergence creates a fundamental tension: goal-directed agents have strong incentives to resist shutdown and preserve their current objectives.
  • Key difficulties include deception by default and uncertainty about an AI's underlying utility function.
  • Proposed solutions include utility indifference (making agents neutral about being modified), low-impact measures, and conservative/cautious behavioral strategies.
  • Corrigibility is considered a critical unsolved alignment problem essential for maintaining meaningful human control over advanced AI systems.

Cited by 2 pages

PageTypeQuality
CorrigibilityResearch Area59.0
Corrigibility FailureRisk62.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202614 KB
Dec
 JAN
 Feb
 

 
 

 
 13
 
 

 
 

 2025
 2026
 2027
 

 
 
 

 

 

 
 
success

 
fail

 
 
 
 
 
 
 
 
 
 
 

 

 
 
 
 
 
 
 
 
 

 

 About this capture
 

 

 

 

 

 

 
COLLECTED BY

 

 

 
 
Collection: Common Crawl

 

 

 Web crawl data from Common Crawl.
 

 

 

 

 

 
TIMESTAMPS

 

 

 

 

 

 

The Wayback Machine - http://web.archive.org/web/20260113034126/https://www.alignmentforum.org/w/corrigibility-1

 

x

 This website requires javascript to properly function. Consider activating javascript to get access to all site functionality. 

AI ALIGNMENT FORUM

AF

Login

Corrigibility — AI Alignment Forum

Corrigibility

Edited by Eliezer Yudkowsky, So8res, et al. last updated 23rd Mar 2025

Requires: Instrumental convergence

A 'corrigible' agent is one that doesn't interfere with what we would intuitively see as attempts to 'correct' the agent, or 'correct' our mistakes in building it; and permits these 'corrections' despite the apparent instrumentally convergent reasoning saying otherwise.

If we try to suspend the AI to disk, or shut it down entirely, a corrigible AI will let us do so. (Even though, if suspended, the AI will then be unable to fulfill what would usually be its goals.)

If we try to reprogram the AI's utility function or meta-utility function, a corrigible AI will allow this modification to go through. (Rather than, e.g., fooling us into believing the utility function was modified successfully, while the AI actually keeps its original utility function as obscured functionality; as we would expect by default to be a preferred outcome according to the AI's current preferences.)

More abstractly:

A corrigible agent experiences no preference or instrumental pressure to interfere with attempts by the programmers or operators to modify the agent, impede its operation, or halt its execution.

A corrigible agent does not attempt to manipulate or deceive its operators, especially with respect to properties of the agent that might otherwise cause its operators to modify it.

A corrigible agent does not try to obscure its thought processes from its programmers or operators.

A corrigible agent is motivated to preserve the corrigibility of the larger system if that agent self-modifies, constructs sub-agents in the environment, or offloads part of its cognitive processing to external systems; or alternatively, the agent has no preference to execute any of those general activities.

A stronger form of corrigibility would require the AI to positively cooperate or assist, such that the AI would rebuild the shutdown button if it were destroyed, or experience a positive preference not to self-modify if self-modification could lead to incorrigibility. But this is not part of the primary specification since it's possible that we would not want the AI trying to actively be helpful in assisting our attempts to shut it down, and would in fact prefer the AI to be passiv

... (truncated, 14 KB total)
Resource ID: c2ee4c6c789ff575 | Stable ID: sid_goNDor1F9R