The Unintended Trade-off of AI Alignment: Balancing Hallucination ...
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
This paper identifies a critical trade-off in LLM alignment where efforts to reduce hallucinations inadvertently weaken safety mechanisms, proposing sparse autoencoders as a solution to disentangle these competing objectives.
Paper Details
Metadata
Summary
This paper identifies and addresses a critical trade-off in LLM alignment: efforts to mitigate hallucinations and improve factual accuracy often inadvertently weaken safety alignment and refusal behavior. The authors demonstrate that hallucination and refusal information are encoded in overlapping model components, causing alignment methods to suppress factual knowledge unintentionally. They propose a solution using sparse autoencoders to disentangle refusal-related features from hallucination features, combined with subspace orthogonalization during fine-tuning to preserve safety alignment while maintaining truthfulness. Evaluation on commonsense reasoning and harmful benchmarks shows their method successfully mitigates the truthfulness-safety trade-off.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Reducing Hallucinations in AI-Generated Wiki Content | Approach | 68.0 |
Cached Content Preview
The Unintended Trade-off of AI Alignment: Balancing Hallucination Mitigation and Safety in LLMs This paper contains text that might be offensive.
The Unintended Trade-off of AI Alignment:
Balancing Hallucination Mitigation and Safety in LLMs
This paper contains text that might be offensive.
Omar Mahmoud ∗ , Ali Khalil ∗ , Buddhika Laknath Semage †
Thommen George Karimpanal ‡ , Santu Rana ∗
∗ Applied Artificial Intelligence Initiative, Deakin University, Australia
‡ School of Information Technology, Deakin University, Australia
† Independent
o.mahmoud@deakin.edu.au
Abstract
Hallucination in large language models (LLMs) has been widely studied in recent years, with progress in both detection and mitigation aimed at improving truthfulness. Yet, a critical side effect remains largely overlooked: enhancing truthfulness can negatively impact safety alignment. In this paper, we investigate this trade-off and show that increasing factual accuracy often comes at the cost of weakened refusal behavior. Our analysis reveals that this arises from overlapping components in the model that simultaneously encode hallucination and refusal information, leading alignment methods to suppress factual knowledge unintentionally.
We further examine how fine-tuning on benign datasets, even when curated for safety, can degrade alignment for the same reason. To address this, we propose a method that disentangles refusal-related features from hallucination features using sparse autoencoders, and preserves refusal behavior during fine-tuning through subspace orthogonalization. This approach prevents hallucinations from increasing while maintaining safety alignment.We evaluate our method on commonsense reasoning tasks and harmful benchmarks (AdvBench and StrongReject). Results demonstrate that our approach preserves refusal behavior and task utility, mitigating the trade-off between truthfulness and safety. 1 1 1 https://github.com/OmarMohammed88/Hall_Refusal
The Unintended Trade-off of AI Alignment:
Balancing Hallucination Mitigation and Safety in LLMs
This paper contains text that might be offensive.
Omar Mahmoud ∗ , Ali Khalil ∗ , Buddhika Laknath Semage †
Thommen George Karimpanal ‡ , Santu Rana ∗
∗ Applied Artificial Intelligence Initiative, Deakin University, Australia
‡ School of Information Technology, Deakin University, Australia
† Independent
o.mahmoud@deakin.edu.au
Figure 1: The truthfulness–safety trade-off. Interventions that improve truthfulness—such as head steering, probing, or representation mapping—can unintentionally compromise safety by disrupting subspaces associated with refusal behavior. The diagram illustrates how enhancing truthfulness may lead to crossing the refusal boundary, potentially degrading safety unless refusal-related features are explicitly preserved.
1 Introduction
Large Language Models (LLMs) have demonstrat
... (truncated, 73 KB total)cfd4dd7a40a56031 | Stable ID: sid_0LQWWvIvpt