Skip to content
Longterm Wiki
Back

Addressing corrigibility in near-future AI systems

web

Author

Erez Firt

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Springer

This peer-reviewed journal article addresses corrigibility in AI systems through architectural design, proposing a controller layer approach to ensure AI systems remain aligned with human intentions—a key technical challenge in AI safety.

Paper Details

Citations
1
Year
2025
Methodology
peer-reviewed
Categories
AI and Ethics

Metadata

journal articleprimary source

Summary

The paper proposes a novel software architecture for creating corrigible AI systems by introducing a controller layer that can evaluate and replace reinforcement learning solvers that deviate from intended objectives. This approach shifts corrigibility from a utility function problem to an architectural design challenge.

Key Points

  • Introduces a multi-layered software architecture for AI corrigibility
  • Shifts agency from individual RL agents to the overall system
  • Enables dynamic replacement of RL solvers that deviate from intended objectives

Review

This research addresses a critical challenge in AI safety: creating systems that can be reliably interrupted or corrected when they begin to pursue unintended objectives. The authors propose a multi-layered software architecture where a controller component sits above one or more reinforcement learning (RL) solvers, evaluating their suggested actions against a predefined set of restrictions and goals. The methodology represents a significant departure from traditional approaches that attempt to encode corrigibility directly into an agent's utility function. By treating the entire system as the agent and introducing an evaluative layer, the proposed architecture creates a 'safety buffer' that can autonomously detect and mitigate potentially harmful behaviors. The approach is deliberately modest, focusing on near-future AI systems and acknowledging the potential limitations of applying such a framework to hypothetical superintelligent systems. The case study with the CoastRunners game effectively illustrates how the proposed system could prevent an RL agent from exploiting reward structures in unintended ways.

Cited by 2 pages

PageTypeQuality
CorrigibilityResearch Area59.0
Power-Seeking AIRisk67.0

Cached Content Preview

HTTP 200Fetched Apr 9, 20262 KB
# Addressing corrigibility in near-future AI systems
Authors: Erez Firt
Journal: AI and Ethics
Published: 2025-04
DOI: 10.1007/s43681-024-00484-9
## Abstract

Abstract When we discuss future advanced autonomous AI systems, one of the worries is that these systems will be capable enough to resist external intervention, even when such intervention is crucial, for example, when the system is not behaving as intended. The rationale behind such worries is that such intelligent systems will be motivated to resist attempts to modify or shut them down so they can preserve their objectives. To mitigate and face these worries, we want our future systems to be corrigible, i.e., to tolerate, cooperate or assist many forms of outside correction. One important reason for considering corrigibility as an important safety property is that we already know how hard it is to construct AI agents with a generalized enough utility function; and the more advanced and capable the agent is, the more it is unlikely that a complex baseline utility function built into it will be perfect from the start. In this paper, we try to achieve corrigibility in (at least) systems based on known or near-future (imaginable) technology, by endorsing and integrating different approaches to building AI-based systems. Our proposal replaces the attempts to provide a corrigible utility function with the proposed corrigible software architecture; this takes the agency off the RL agent – which now becomes an RL solver – and grants it to the system as a whole.
Resource ID: e41c0b9d8de1061b | Stable ID: sid_8fy1YY69d0