Skip to content
Longterm Wiki
Back

The Unintended Trade-off of AI Alignment: Balancing Hallucination ...

paper

Authors

Omar Mahmoud·Ali Khalil·Thommen George Karimpanal·Buddhika Laknath Semage·Santu Rana

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

This paper identifies a critical trade-off in LLM alignment where efforts to reduce hallucinations inadvertently weaken safety mechanisms, proposing sparse autoencoders as a solution to disentangle these competing objectives.

Paper Details

Citations
0
0 influential
Year
2026
Methodology
peer-reviewed
Categories
Findings of the Association for Computational Ling

Metadata

arxiv preprintprimary source

Summary

This paper identifies and addresses a critical trade-off in LLM alignment: efforts to mitigate hallucinations and improve factual accuracy often inadvertently weaken safety alignment and refusal behavior. The authors demonstrate that hallucination and refusal information are encoded in overlapping model components, causing alignment methods to suppress factual knowledge unintentionally. They propose a solution using sparse autoencoders to disentangle refusal-related features from hallucination features, combined with subspace orthogonalization during fine-tuning to preserve safety alignment while maintaining truthfulness. Evaluation on commonsense reasoning and harmful benchmarks shows their method successfully mitigates the truthfulness-safety trade-off.

Cited by 1 page

Cached Content Preview

HTTP 200Fetched Apr 9, 202673 KB
The Unintended Trade-off of AI Alignment: Balancing Hallucination Mitigation and Safety in LLMs This paper contains text that might be offensive. 
 
 
 
 

 
 

 
 
 
 
 
 
 
The Unintended Trade-off of AI Alignment:

 Balancing Hallucination Mitigation and Safety in LLMs
 This paper contains text that might be offensive.

 
 
 
Omar Mahmoud ∗ , Ali Khalil ∗ , Buddhika Laknath Semage † 
 Thommen George Karimpanal ‡ , Santu Rana ∗ 
 ∗ Applied Artificial Intelligence Initiative, Deakin University, Australia 
 ‡ School of Information Technology, Deakin University, Australia 
 † Independent 
 o.mahmoud@deakin.edu.au 
 
 
 
 Abstract

 Hallucination in large language models (LLMs) has been widely studied in recent years, with progress in both detection and mitigation aimed at improving truthfulness. Yet, a critical side effect remains largely overlooked: enhancing truthfulness can negatively impact safety alignment. In this paper, we investigate this trade-off and show that increasing factual accuracy often comes at the cost of weakened refusal behavior. Our analysis reveals that this arises from overlapping components in the model that simultaneously encode hallucination and refusal information, leading alignment methods to suppress factual knowledge unintentionally.
We further examine how fine-tuning on benign datasets, even when curated for safety, can degrade alignment for the same reason. To address this, we propose a method that disentangles refusal-related features from hallucination features using sparse autoencoders, and preserves refusal behavior during fine-tuning through subspace orthogonalization. This approach prevents hallucinations from increasing while maintaining safety alignment.We evaluate our method on commonsense reasoning tasks and harmful benchmarks (AdvBench and StrongReject). Results demonstrate that our approach preserves refusal behavior and task utility, mitigating the trade-off between truthfulness and safety. 1 1 1 https://github.com/OmarMohammed88/Hall_Refusal 

 
 
 
 
The Unintended Trade-off of AI Alignment:

 Balancing Hallucination Mitigation and Safety in LLMs
 This paper contains text that might be offensive. 

 
 
 
 
 
Omar Mahmoud ∗ , Ali Khalil ∗ , Buddhika Laknath Semage † 
 
 Thommen George Karimpanal ‡ , Santu Rana ∗ 
 
 
 
 ∗ Applied Artificial Intelligence Initiative, Deakin University, Australia 
 
 ‡ School of Information Technology, Deakin University, Australia 
 
 † Independent 
 
 o.mahmoud@deakin.edu.au 
 
 

 
 
 
 Figure 1: The truthfulness–safety trade-off. Interventions that improve truthfulness—such as head steering, probing, or representation mapping—can unintentionally compromise safety by disrupting subspaces associated with refusal behavior. The diagram illustrates how enhancing truthfulness may lead to crossing the refusal boundary, potentially degrading safety unless refusal-related features are explicitly preserved. 
 
 
 
 1 Introduction

 
 Large Language Models (LLMs) have demonstrat

... (truncated, 73 KB total)
Resource ID: cfd4dd7a40a56031 | Stable ID: sid_0LQWWvIvpt