Shutdown-seeking AI

web

2025·Springer(peer-reviewed)·link.springer.com/article/10.1007/s11098-024-02099-6

Authors

Simon Goldstein·Pamela Robinson

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Springer

Peer-reviewed journal article proposing shutdown-seeking AI as a novel safety approach where agents are designed to self-terminate if developing harmful capabilities, addressing AI control and value alignment.

Paper Details

Citations

Year

2025

Methodology

peer-reviewed

Metadata

journal articleprimary source

Summary

The authors propose a novel AI safety approach of creating shutdown-seeking AIs with a final goal of being shut down. This strategy aims to prevent dangerous AI behaviors by designing agents that will self-terminate if they develop harmful capabilities.

Key Points

•AIs designed with a singular goal of shutdown could reduce risks of uncontrolled AI behavior
•The approach offers a novel 'beneficial goal misalignment' strategy for AI safety
•Shutdown-seeking AIs could function as 'tripwires' to detect and limit dangerous capabilities

Review

The paper presents a unique approach to AI safety by suggesting the development of artificial intelligence systems with a singular goal of shutdown. Unlike traditional alignment strategies that attempt to create goals matching human values, this 'beneficial goal misalignment' approach proposes an AI that fundamentally wants to be turned off. The authors argue this strategy offers three key benefits: improved specification in reinforcement learning, reduced risks from instrumental convergence, and a built-in 'tripwire' for monitoring dangerous capabilities. The methodology involves carefully designing an AI's environment so that shutdown is only possible after completing beneficial tasks, creating a safety mechanism that prevents uncontrolled AI behavior. While acknowledging potential challenges like manipulation risks, the authors contend that shutdown-seeking AIs could provide a pragmatic approach to AI safety by ensuring that any developed dangerous capabilities would result in self-termination. The proposal represents an innovative perspective in AI safety research, offering a provocative alternative to existing alignment frameworks by fundamentally reimagining the goal structure of artificial intelligence.

Cited by 2 pages

Page	Type	Quality
Corrigibility Failure	Risk	62.0
Power-Seeking AI	Risk	67.0

Cached Content Preview

HTTP 200Fetched Apr 9, 20261 KB

# Shutdown-seeking AI
Authors: Simon Goldstein, Pamela Robinson
Journal: Philosophical Studies
Published: 2025-07
DOI: 10.1007/s11098-024-02099-6
## Abstract

Abstract We propose developing AIs whose only final goal is being shut down. We argue that this approach to AI safety has three benefits: (i) it could potentially be implemented in reinforcement learning, (ii) it avoids some dangerous instrumental convergence dynamics, and (iii) it creates trip wires for monitoring dangerous capabilities. We also argue that the proposal can overcome a key challenge raised by Soares et al. (2015), that shutdown-seeking AIs will manipulate humans into shutting them down. We conclude by comparing our approach with Soares et al.'s corrigibility framework.

Resource ID: ad0040411353497f | Stable ID: sid_rYWKmbPzeG