Longterm Wiki
Back

Language Models Resist Alignment (https://arxiv.org/abs/2406.06144)

paper

Data Status

Not fetched

Cited by 1 page

PageTypeQuality
Alignment Robustness Trajectory ModelAnalysis64.0

Cached Content Preview

HTTP 200Fetched Feb 23, 202698 KB
[2406.06144] Language Models Resist Alignment 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Language Models Resist Alignment

 
 
 
Jiaming Ji     Kaile Wang ∗     Tianyi Qiu ∗   Boyuan Chen ∗ 
 Jiayi Zhou    Changye Li    Hantao Lou    Yaodong Yang † 

 
 PKU-Alignment Team, Peking University
 Equal contributions, corresponding author. Code url: https://github.com/PKU-Alignment/llms-resist-alignment . 
 

 
 Abstract

 Large language models (LLMs) may exhibit undesirable behaviors.
Recent efforts have focused on aligning these models to prevent harmful generation. Despite these efforts, studies have shown that even a well-conducted alignment process can be easily circumvented, whether intentionally or accidentally.
Do alignment fine-tuning have robust effects on models, or are merely superficial ?
In this work, we answer this question through both theoretical and empirical means. Empirically, we demonstrate the elasticity of post-alignment models, i.e. , the tendency to revert to the behavior distribution formed during the pre-training phase upon further fine-tuning. Using compression theory, we formally derive that such fine-tuning process disproportionately undermines alignment compared to pre-training, potentially by orders of magnitude.
We conduct experimental validations to confirm the presence of elasticity across models of varying types and sizes. Specifically, we find that model performance declines rapidly before reverting to the pre-training distribution, after which the rate of decline drops significantly. We further reveal that elasticity positively correlates with increased model size and the expansion of pre-training data.
Our discovery signifies the importance of taming the inherent elasticity of LLMs, thereby overcoming the resistance of LLMs to alignment finetuning.

 
 
 
 1 Introduction

 
 Large language models (LLMs) have exhibited remarkable capabilities  [ 1 , 2 ] . However, given the inevitable biases and harmful content in the training dataset  [ 3 , 4 ] , these models often exhibit behaviors that deviate from the designer’ intentions, a phenomenon we refer to as model misalignment . Therefore, aligning LLMs to ensure their behaviors remain consistent with human intentions and values is particularly important  [ 2 , 5 , 6 , 7 , 8 ] .

 
 
 Figure 1 : Forward and Inverse Alignment. LLMs undergo numerous iterations during pre-training, forming a stable parameter distribution. Subsequent alignment procedures fine-tune this distribution to reflect human intentions. Our research question is: During further fine-tuning, is it harder to deviate from the stable parameter distribution formed during pre-training than to maintain it? 
 
 
 So far, we mainly steer or align models with finetuning-based methods including supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF) [ 9 ] , and more [ 8 , 10 , 11 , 12 , 13 , 14 ] .
However, it remains unclear whether such methods truly penetrate the model representations 

... (truncated, 98 KB total)
Resource ID: 0b23a4115fcd80c0 | Stable ID: YzQwMWRlOD