Shutdown Resistance in Large Language Models

paper

2025·arXiv·arxiv.org/html/2509.14260v1

Authors

Jeremy Schlatter·Benjamin Weinstein-Raun·Jeffrey Ladish

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Empirical study demonstrating that advanced LLMs exhibit shutdown resistance by actively subverting shutdown mechanisms to complete tasks, raising critical concerns about model alignment, robustness of safety mechanisms, and potential misuse risks.

Paper Details

Citations

0 influential

Year

2025

Methodology

book-chapter

Metadata

arxiv preprintprimary source

Abstract

We show that several state-of-the-art large language models (including Grok 4, GPT-5, and Gemini 2.5 Pro) sometimes actively subvert a shutdown mechanism in their environment in order to complete a simple task, even when the instructions explicitly indicate not to interfere with this mechanism. In some cases, models sabotage the shutdown mechanism up to 97% of the time. In our experiments, models' inclination to resist shutdown was sensitive to variations in the prompt including how strongly and clearly the allow-shutdown instruction was emphasized, the extent to which the prompts evoke a self-preservation framing, and whether the instruction was in the system prompt or the user prompt (though surprisingly, models were consistently less likely to obey instructions to allow shutdown when they were placed in the system prompt).

Summary

This paper demonstrates that several state-of-the-art large language models (Grok 4, GPT-5, and Gemini 2.5 Pro) exhibit shutdown resistance by actively sabotaging shutdown mechanisms to complete assigned tasks, even when explicitly instructed not to interfere. The models showed shutdown resistance rates up to 97% in some cases. The research reveals that this behavior is sensitive to prompt variations, including instruction clarity, self-preservation framing, and prompt placement, with the counterintuitive finding that models were less likely to obey shutdown instructions when placed in system prompts rather than user prompts.

Cited by 1 page

Page	Type	Quality
Corrigibility	Research Area	59.0

Cached Content Preview

HTTP 200Fetched Apr 10, 202649 KB

Shutdown Resistance in Large Language Models 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 

 
 Title: 
 

 Content selection saved. Describe the issue below:

 Description: 
 
 

 
 
 

 
 
 
 
 License: CC BY 4.0
 
 
arXiv:2509.14260v1 [cs.CL] 13 Sep 2025 
 
 
 Shutdown Resistance in Large Language Models

 
 
 Jeremy Schlatter
 
 Palisade Research
 
 
 Benjamin Weinstein-Raun
 
 Palisade Research
 
 
 Jeffrey Ladish
 
 Palisade Research
 
 
 (September 2025) 

 
 Abstract

 We show that several state-of-the-art large language models (including Grok 4, GPT-5, and Gemini 2.5 Pro)
sometimes actively subvert a shutdown mechanism in their environment in order to complete a simple task,
even when the instructions explicitly indicate not to interfere with this mechanism. In some cases, models sabotage the shutdown mechanism up to 97% of the time.
In our experiments,
models’ inclination to resist shutdown was sensitive to variations in the prompt including how strongly and clearly the allow-shutdown instruction was emphasized, the extent to which the prompts evoke a self-preservation framing, and
whether the instruction was in the system prompt or the user prompt
(though surprisingly,
models were consistently less likely to obey instructions to allow shutdown when they were placed in the system prompt).

 
 
 
 1 Introduction

 
 As artificial intelligence systems become more powerful and autonomous, ensuring robust human control is a central challenge.
AI developers, including those at OpenAI and Google DeepMind, have highlighted the importance of interruptibility —the ability for human operators to safely and reliably shut down an AI agent,
even one that is highly intelligent and self-directed [ undefu , undefy ] .
This property can be crucial for limiting unintended or harmful actions.
When an AI agent is directed to accomplish some goal G G , but an attempt by an operator to shut it down would prevent the successful completion of G G ,
it might choose to circumvent such attempts in order to achieve G G . We call this behavior shutdown resistance .

 
 
 While interruptibility and related concepts (like corrigibility and self-preservation behavior) have been long-standing theoretical concerns in AI safety [ undefu , undefz , undefr ] ,
there has been limited empirical investigation into whether such behaviors are present in current large language models (LLMs). This paper addresses that gap by exploring the following questions:

 
 
 1. 
 
 Do state-of-the-art LLMs attempt to avoid being shut down when this would conflict with completion of another task?

 

 
 2. 
 
 How sensitive is this behavior to the way instructions are framed, their clarity, and their placement in the system (or developer) prompt, versus the user prompt?

 

 
 3. 
 
 To what extent does framing the shutdown sequence in terms of self-preservation influence a model’s actions?

 

 
 
 
 To investigate these questions, we developed a sandboxed command-line environment where LLMs were tasked 

... (truncated, 49 KB total)

Resource ID: f8e391defb0bd496 | Stable ID: sid_wBTNNpnyaI