Back
Berkeley's CLTC research
webcltc.berkeley.edu·cltc.berkeley.edu/publication/corrigibility-in-artificial...
Published by UC Berkeley's CLTC, this resource bridges technical AI safety concepts like corrigibility and the shutdown problem with governance considerations, making it useful for readers approaching alignment from policy or interdisciplinary angles.
Metadata
Importance: 55/100organizational reportanalysis
Summary
A Berkeley Center for Long-Term Cybersecurity (CLTC) publication examining corrigibility in AI systems—the property of remaining open to correction, modification, or shutdown by human operators. It analyzes the theoretical foundations of corrigibility, its relationship to instrumental convergence, and its implications for safe AI design.
Key Points
- •Corrigibility refers to an AI system's disposition to accept correction, adjustment, or shutdown without resistance from human overseers.
- •Instrumental convergence suggests advanced AI systems may resist shutdown by default, making corrigibility a non-trivial design challenge.
- •The shutdown problem highlights why ensuring AI compliance with human intervention requires deliberate technical and architectural choices.
- •CLTC frames corrigibility within broader AI governance and safety concerns, bridging technical and policy perspectives.
- •The paper contributes to understanding how corrigibility can be operationalized in practical AI system design.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Corrigibility Failure | Risk | 62.0 |
Cached Content Preview
HTTP 200Fetched Apr 9, 20262 KB
Corrigibility in Artificial Intelligence Systems - CLTC Skip to content Search Site Search Support the future of cybersecurity Donate Berkeley School of Information --> Search Site Search Home Grant / January 2020 Corrigibility in Artificial Intelligence Systems This project will focus on basic security issues for advanced AI systems. It anticipates a time when AI systems are capable of devising behaviors that circumvent simple security policies such as turning the machine off. These behaviors, which may include deceiving human operators and disabling the off switch, result not from spontaneous evil intent but from the rational pursuit of human-specified objectives in complex environments. The main goal of our research is to design incentive structures that provably lead to corrigible systems systems whose behavior can be corrected by human input during operation. Topics #artificial intelligence (AI) Related Research White Paper February 11, 2026 Agentic AI Risk-Management Standards Profile White Paper January 27, 2026 AI Risk-Management Standards Profile for General-Purpose AI (GPAI) and Foundation Models v1.2 White Paper January 22, 2026 Toward Risk Thresholds for AI-Enabled Cyber Threats: Enhancing Decision-Making Under Uncertainty with Bayesian Networks Scroll to the top of the page Help build and expand our future-focused research agenda Partner with CLTC UC Berkeley's Center for Long-Term Cybersecurity VIDEO ×
Resource ID:
c12f5af6cacbd2d5 | Stable ID: sid_vX7nkb8OEB