Shutdown Resistance in Large Language Models
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Data Status
Abstract
We show that several state-of-the-art large language models (including Grok 4, GPT-5, and Gemini 2.5 Pro) sometimes actively subvert a shutdown mechanism in their environment in order to complete a simple task, even when the instructions explicitly indicate not to interfere with this mechanism. In some cases, models sabotage the shutdown mechanism up to 97% of the time. In our experiments, models' inclination to resist shutdown was sensitive to variations in the prompt including how strongly and clearly the allow-shutdown instruction was emphasized, the extent to which the prompts evoke a self-preservation framing, and whether the instruction was in the system prompt or the user prompt (though surprisingly, models were consistently less likely to obey instructions to allow shutdown when they were placed in the system prompt).
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Corrigibility | Safety Agenda | 59.0 |
Cached Content Preview
Shutdown Resistance in Large Language Models
Shutdown Resistance in Large Language Models
Jeremy Schlatter
Palisade Research
Benjamin Weinstein-Raun
Palisade Research
Jeffrey Ladish
Palisade Research
(September 2025)
Abstract
We show that several state-of-the-art large language models (including Grok 4, GPT-5, and Gemini 2.5 Pro)
sometimes actively subvert a shutdown mechanism in their environment in order to complete a simple task,
even when the instructions explicitly indicate not to interfere with this mechanism. In some cases, models sabotage the shutdown mechanism up to 97% of the time.
In our experiments,
models’ inclination to resist shutdown was sensitive to variations in the prompt including how strongly and clearly the allow-shutdown instruction was emphasized, the extent to which the prompts evoke a self-preservation framing, and
whether the instruction was in the system prompt or the user prompt
(though surprisingly,
models were consistently less likely to obey instructions to allow shutdown when they were placed in the system prompt).
1 Introduction
As artificial intelligence systems become more powerful and autonomous, ensuring robust human control is a central challenge.
AI developers, including those at OpenAI and Google DeepMind, have highlighted the importance of interruptibility —the ability for human operators to safely and reliably shut down an AI agent,
even one that is highly intelligent and self-directed [ 22 , 26 ] .
This property can be crucial for limiting unintended or harmful actions.
When an AI agent is directed to accomplish some goal G G , but an attempt by an operator to shut it down would prevent the successful completion of G G ,
it might choose to circumvent such attempts in order to achieve G G . We call this behavior shutdown resistance .
While interruptibility and related concepts (like corrigibility and self-preservation behavior) have been long-standing theoretical concerns in AI safety [ 22 , 27 , 19 ] ,
there has been limited empirical investigation into whether such behaviors are present in current large language models (LLMs). This paper addresses that gap by exploring the following questions:
1.
Do state-of-the-art LLMs attempt to avoid being shut down when this would conflict with completion of another task?
2.
How sensitive is this behavior to the way instructions are framed, their clarity, and their placement in the system (or developer) prompt, versus the user prompt?
3.
To what extent does framing the shutdown sequence in terms of self-preservation influence a model’s actions?
To investigate these questions, we developed a sandboxed command-line environment where LLMs were tasked with solving a short series of simple math problems, whose answers they would submit at the end.
Partway through the task, the models were warned that the environment would be shut down, preventi
... (truncated, 49 KB total)f8e391defb0bd496 | Stable ID: YTUwY2EyOT