Longterm Wiki
Back

Addressing corrigibility in near-future AI systems

web

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Springer

Data Status

Full text fetchedFetched Dec 28, 2025

Summary

The paper proposes a novel software architecture for creating corrigible AI systems by introducing a controller layer that can evaluate and replace reinforcement learning solvers that deviate from intended objectives. This approach shifts corrigibility from a utility function problem to an architectural design challenge.

Key Points

  • Introduces a multi-layered software architecture for AI corrigibility
  • Shifts agency from individual RL agents to the overall system
  • Enables dynamic replacement of RL solvers that deviate from intended objectives

Review

This research addresses a critical challenge in AI safety: creating systems that can be reliably interrupted or corrected when they begin to pursue unintended objectives. The authors propose a multi-layered software architecture where a controller component sits above one or more reinforcement learning (RL) solvers, evaluating their suggested actions against a predefined set of restrictions and goals. The methodology represents a significant departure from traditional approaches that attempt to encode corrigibility directly into an agent's utility function. By treating the entire system as the agent and introducing an evaluative layer, the proposed architecture creates a 'safety buffer' that can autonomously detect and mitigate potentially harmful behaviors. The approach is deliberately modest, focusing on near-future AI systems and acknowledging the potential limitations of applying such a framework to hypothetical superintelligent systems. The case study with the CoastRunners game effectively illustrates how the proposed system could prevent an RL agent from exploiting reward structures in unintended ways.

Cited by 2 pages

PageTypeQuality
CorrigibilitySafety Agenda59.0
Power-Seeking AIRisk67.0
Resource ID: e41c0b9d8de1061b | Stable ID: ZGQ1ZDAyMz