Berkeley's CLTC research

web

cltc.berkeley.edu·cltc.berkeley.edu/publication/corrigibility-in-artificial...

Published by UC Berkeley's CLTC, this resource bridges technical AI safety concepts like corrigibility and the shutdown problem with governance considerations, making it useful for readers approaching alignment from policy or interdisciplinary angles.

Metadata

Importance: 55/100organizational reportanalysis

Summary

A Berkeley Center for Long-Term Cybersecurity (CLTC) publication examining corrigibility in AI systems—the property of remaining open to correction, modification, or shutdown by human operators. It analyzes the theoretical foundations of corrigibility, its relationship to instrumental convergence, and its implications for safe AI design.

Key Points

•Corrigibility refers to an AI system's disposition to accept correction, adjustment, or shutdown without resistance from human overseers.
•Instrumental convergence suggests advanced AI systems may resist shutdown by default, making corrigibility a non-trivial design challenge.
•The shutdown problem highlights why ensuring AI compliance with human intervention requires deliberate technical and architectural choices.
•CLTC frames corrigibility within broader AI governance and safety concerns, bridging technical and policy perspectives.
•The paper contributes to understanding how corrigibility can be operationalized in practical AI system design.

Cited by 1 page

Page	Type	Quality
Corrigibility Failure	Risk	62.0

Cached Content Preview

HTTP 200Fetched Apr 9, 20262 KB

Corrigibility in Artificial Intelligence Systems - CLTC 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 

 

 
 
 

 
 
 
 
 
 
 
 

 
 
 
 
 Skip to content 

 
 

 

 
 

 

 
 Search Site 
 
 
 
 Search 
 
 

 

 
 
 
 Support the future of cybersecurity

 
 Donate 
 
 
 
 

 
 

 
 
 Berkeley School of Information 

 
 
 

 -->
 
 
 
 
 

 
 Search Site 
 
 
 
 Search 
 
 
 
 

 
 
 
 
 
 Home 

 
 
 

 
 
 
 Grant /
 January 2020 
 

 
 Corrigibility in Artificial Intelligence Systems

 

 
 
 
 
 
 
 
 
 
 This project will focus on basic security issues for advanced AI systems. It anticipates a time when AI systems are capable of devising behaviors that circumvent simple security policies such as turning the machine off. These behaviors, which may include deceiving human operators and disabling the off switch, result not from spontaneous evil intent but from the rational pursuit of human-specified objectives in complex environments. The main goal of our research is to design incentive structures that provably lead to corrigible systems  systems whose behavior can be corrected by human input during operation.

 
 Topics

 
 
 #artificial intelligence (AI) 
 

 
 
 
 

 
 
 
 
 

 

 Related Research

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 White Paper 
 
 
 February 11, 2026 
 
 Agentic AI Risk-Management Standards Profile 

 
 

 

 
 
 
 
 
 
 
 
 
 
 
 
 White Paper 
 
 
 January 27, 2026 
 
 AI Risk-Management Standards Profile for General-Purpose AI (GPAI) and Foundation Models v1.2 

 
 

 

 
 
 
 
 
 
 
 
 
 
 
 
 White Paper 
 
 
 January 22, 2026 
 
 Toward Risk Thresholds for AI-Enabled Cyber Threats: Enhancing Decision-Making Under Uncertainty with Bayesian Networks 

 
 

 

 
 
 
 
 

 
 
 

 
 Scroll to the top of the page 
 

 
 
 
 Help build and expand our future-focused research agenda

 
 Partner with CLTC 
 
 
 

 
 
 
 
 

 
 

 

 
 
 UC Berkeley's Center for Long-Term Cybersecurity VIDEO 
 
 
 
 
 
 
 

 
 
 ×

Resource ID: c12f5af6cacbd2d5 | Stable ID: sid_vX7nkb8OEB