Arditi et al., Refusal in Language Models Is Mediated by a Single Direction (https://arxiv.org/abs/2406.11717)

paper

2024·arXiv·arxiv.org/abs/2406.11717

Authors

Andy Arditi·Wes Gurnee·Neel Nanda·Oscar Obeso·Daniel Paleka·Nina Panickssery·Aaquib Syed

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

This paper reveals that LLM refusal behavior is controlled by a single interpretable direction in activation space, enabling both mechanistic understanding of safety mechanisms and development of jailbreak techniques—critical for understanding vulnerabilities in model alignment.

Paper Details

Citations

537

114 influential

Year

2024

Methodology

peer-reviewed

Metadata

arxiv preprintprimary source

Summary

This paper demonstrates that refusal behavior in large language models is mediated by a single one-dimensional direction in the model's activation space, consistent across 13 popular open-source chat models up to 72B parameters. The authors identify this 'refusal direction' and show that erasing it prevents models from refusing harmful requests while amplifying it causes refusal on benign instructions. They leverage this finding to develop a white-box jailbreak method that surgically disables refusal with minimal impact on other capabilities, and mechanistically analyze how adversarial suffixes suppress the refusal direction. The work highlights the brittleness of current safety fine-tuning approaches and demonstrates how mechanistic interpretability can be used to control model behavior.

Cited by 2 pages

Page	Type	Quality
Alignment Robustness Trajectory Model	Analysis	64.0
Representation Engineering	Approach	72.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202698 KB

[2406.11717] Refusal in Language Models Is Mediated by a Single Direction 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Refusal in Language Models 
 Is Mediated by a Single Direction

 
 
 
Andy Arditi ∗ 
 Independent
&Oscar Obeso ∗ 
 ETH Zürich
&Aaquib Syed
 University of Maryland
&Daniel Paleka
 ETH Zürich
Nina Rimsky
 Anthropic
&Wes Gurnee
 MIT
&Neel Nanda
 
 

 
 Abstract

 Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones.
While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood.
In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size.
Specifically, for each model, we find a single direction such that erasing this direction from the model’s residual stream activations prevents it from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions.
Leveraging this insight, we propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities.
Finally, we mechanistically analyze how adversarial suffixes suppress propagation of the refusal-mediating direction.
Our findings underscore the brittleness of current safety fine-tuning methods.
More broadly, our work showcases how an understanding of model internals can be leveraged to develop practical methods for controlling model behavior. 2 2 2 Code available at https://github.com/andyrdt/refusal_direction . 
 

 
 1 1 footnotetext: Correspondence to andyrdt@gmail.com , obalcells@student.ethz.ch . 
 
 
 1 Introduction

 
 Deployed large language models (LLMs) undergo multiple rounds of fine-tuning to become both helpful and harmless : to provide helpful responses to innocuous user requests, but to refuse harmful or inappropriate ones (Bai et al., 2022 ) .
Naturally, large numbers of users and researchers alike have attempted to circumvent these defenses using a wide array of jailbreak attacks (Wei et al., 2023 ; Xu et al., 2024 ; Chu et al., 2024 ) to
uncensor model outputs,
including fine-tuning techniques (Yang et al., 2023 ; Lermen et al., 2023 ; Zhan et al., 2023 ) .
While the consequences of a successful attack on current chat assistants are modest, the scale and severity of harm from misuse could increase dramatically if frontier models are endowed with increased agency and autonomy (Anthropic, 2024 ) .
That is, as models are deployed in higher-stakes settings and
are able to take actions in the real world, the ability to robustly refuse a request to cause harm is an essential requirement of a safe AI system. Inspired by the rapid progress of mechanistic interpretability (Nanda et al., 2023 ; Bricken et al., 2023 ; Marks et al., 2024 ; Templeton et al., 2024 ) and activation steering (Zou et al., 2023a ; Turner et al., 2023 ; Rimsky et al., 2023

... (truncated, 98 KB total)

Resource ID: ae4bb1285386c3e1 | Stable ID: sid_Si7dzvYRR7