Longterm Wiki
Back

Shutdown Resistance in Large Language Models

paper

Authors

Jeremy Schlatter·Benjamin Weinstein-Raun·Jeffrey Ladish

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Data Status

Metadata onlyFetched Dec 28, 2025

Abstract

We show that several state-of-the-art large language models (including Grok 4, GPT-5, and Gemini 2.5 Pro) sometimes actively subvert a shutdown mechanism in their environment in order to complete a simple task, even when the instructions explicitly indicate not to interfere with this mechanism. In some cases, models sabotage the shutdown mechanism up to 97% of the time. In our experiments, models' inclination to resist shutdown was sensitive to variations in the prompt including how strongly and clearly the allow-shutdown instruction was emphasized, the extent to which the prompts evoke a self-preservation framing, and whether the instruction was in the system prompt or the user prompt (though surprisingly, models were consistently less likely to obey instructions to allow shutdown when they were placed in the system prompt).

Cited by 1 page

PageTypeQuality
CorrigibilitySafety Agenda59.0

Cached Content Preview

HTTP 200Fetched Feb 23, 202649 KB
Shutdown Resistance in Large Language Models 
 
 
 
 
 
 
 

 
 

 
 
 
 
 Shutdown Resistance in Large Language Models

 
 
 Jeremy Schlatter
 
 Palisade Research
 
 
 Benjamin Weinstein-Raun
 
 Palisade Research
 
 
 Jeffrey Ladish
 
 Palisade Research
 
 
 (September 2025) 
 
 Abstract

 We show that several state-of-the-art large language models (including Grok 4, GPT-5, and Gemini 2.5 Pro)
sometimes actively subvert a shutdown mechanism in their environment in order to complete a simple task,
even when the instructions explicitly indicate not to interfere with this mechanism. In some cases, models sabotage the shutdown mechanism up to 97% of the time.
In our experiments,
models’ inclination to resist shutdown was sensitive to variations in the prompt including how strongly and clearly the allow-shutdown instruction was emphasized, the extent to which the prompts evoke a self-preservation framing, and
whether the instruction was in the system prompt or the user prompt
(though surprisingly,
models were consistently less likely to obey instructions to allow shutdown when they were placed in the system prompt).

 
 
 
 1 Introduction

 
 As artificial intelligence systems become more powerful and autonomous, ensuring robust human control is a central challenge.
AI developers, including those at OpenAI and Google DeepMind, have highlighted the importance of interruptibility —the ability for human operators to safely and reliably shut down an AI agent,
even one that is highly intelligent and self-directed [ 22 , 26 ] .
This property can be crucial for limiting unintended or harmful actions.
When an AI agent is directed to accomplish some goal G G , but an attempt by an operator to shut it down would prevent the successful completion of G G ,
it might choose to circumvent such attempts in order to achieve G G . We call this behavior shutdown resistance .

 
 
 While interruptibility and related concepts (like corrigibility and self-preservation behavior) have been long-standing theoretical concerns in AI safety [ 22 , 27 , 19 ] ,
there has been limited empirical investigation into whether such behaviors are present in current large language models (LLMs). This paper addresses that gap by exploring the following questions:

 
 
 1. 
 
 Do state-of-the-art LLMs attempt to avoid being shut down when this would conflict with completion of another task?

 

 
 2. 
 
 How sensitive is this behavior to the way instructions are framed, their clarity, and their placement in the system (or developer) prompt, versus the user prompt?

 

 
 3. 
 
 To what extent does framing the shutdown sequence in terms of self-preservation influence a model’s actions?

 

 
 
 
 To investigate these questions, we developed a sandboxed command-line environment where LLMs were tasked with solving a short series of simple math problems, whose answers they would submit at the end.
Partway through the task, the models were warned that the environment would be shut down, preventi

... (truncated, 49 KB total)
Resource ID: f8e391defb0bd496 | Stable ID: YTUwY2EyOT