Longterm Wiki
Back

Arditi et al., *Refusal in Language Models Is Mediated by a Single Direction* (https://arxiv.org/abs/2406.11717)

paper

Data Status

Not fetched

Cited by 2 pages

Cached Content Preview

HTTP 200Fetched Feb 23, 202698 KB
[2406.11717] Refusal in Language Models Is Mediated by a Single Direction 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Refusal in Language Models 
 Is Mediated by a Single Direction

 
 
 
Andy Arditi ∗ 
 Independent
&Oscar Obeso ∗ 
 ETH Zürich
&Aaquib Syed
 University of Maryland
&Daniel Paleka
 ETH Zürich
Nina Rimsky
 Anthropic
&Wes Gurnee
 MIT
&Neel Nanda
 
 

 
 Abstract

 Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones.
While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood.
In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size.
Specifically, for each model, we find a single direction such that erasing this direction from the model’s residual stream activations prevents it from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions.
Leveraging this insight, we propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities.
Finally, we mechanistically analyze how adversarial suffixes suppress propagation of the refusal-mediating direction.
Our findings underscore the brittleness of current safety fine-tuning methods.
More broadly, our work showcases how an understanding of model internals can be leveraged to develop practical methods for controlling model behavior. 2 2 2 Code available at https://github.com/andyrdt/refusal_direction . 
 

 
 1 1 footnotetext: Correspondence to andyrdt@gmail.com , obalcells@student.ethz.ch . 
 
 
 1 Introduction

 
 Deployed large language models (LLMs) undergo multiple rounds of fine-tuning to become both helpful and harmless : to provide helpful responses to innocuous user requests, but to refuse harmful or inappropriate ones (Bai et al., 2022 ) .
Naturally, large numbers of users and researchers alike have attempted to circumvent these defenses using a wide array of jailbreak attacks (Wei et al., 2023 ; Xu et al., 2024 ; Chu et al., 2024 ) to
uncensor model outputs,
including fine-tuning techniques (Yang et al., 2023 ; Lermen et al., 2023 ; Zhan et al., 2023 ) .
While the consequences of a successful attack on current chat assistants are modest, the scale and severity of harm from misuse could increase dramatically if frontier models are endowed with increased agency and autonomy (Anthropic, 2024 ) .
That is, as models are deployed in higher-stakes settings and
are able to take actions in the real world, the ability to robustly refuse a request to cause harm is an essential requirement of a safe AI system. Inspired by the rapid progress of mechanistic interpretability (Nanda et al., 2023 ; Bricken et al., 2023 ; Marks et al., 2024 ; Templeton et al., 2024 ) and activation steering (Zou et al., 2023a ; Turner et al., 2023 ; Rimsky et al., 2023

... (truncated, 98 KB total)
Resource ID: ae4bb1285386c3e1 | Stable ID: YjY3Y2IyMG