The Unintended Trade-off of AI Alignment:Balancing Hallucination Mitigation and Safety in LLMs
Hallucination in large language models (LLMs) has been widely studied in recent years, with progress in both detection and mitigation aimed at improving truthfulness. Yet, a critical side effect remains largely overlooked: enhancing truthfulness can negatively impact safety alignment. In this paper, we investigate this trade-off and show that increasing factual accuracy often comes at the cost of weakened refusal behavior. Our analysis reveals that this arises from overlapping components in the model that simultaneously encode hallucination and refusal information, leading alignment methods to suppress factual knowledge unintentionally. We further examine how fine-tuning on benign datasets, even when curated for safety, can degrade alignment for the same reason. To address this, we propose a method that disentangles refusal-related features from hallucination features using sparse autoencoders, and preserves refusal behavior during fine-tuning through subspace orthogonalization. This approach prevents hallucinations from increasing while maintaining safety alignment.We evaluate our method on commonsense reasoning tasks and harmful benchmarks (AdvBench and StrongReject). Results demonstrate that our approach preserves refusal behavior and task utility, mitigating the trade-off between truthfulness and safety.
💡 Research Summary
The paper investigates an often‑overlooked side effect of truthfulness‑enhancing interventions in large language models (LLMs): they can weaken the model’s refusal behavior and thus degrade safety alignment. The authors first demonstrate empirically that two representative truthfulness‑boosting methods—ITI (truthful head steering) and TruthfulX (latent‑direction steering)—significantly improve factual accuracy on the TruthfulQA benchmark but simultaneously increase attack success rates (ASR) on harmful safety benchmarks such as AdvBench and StrongReject. This trade‑off suggests that the internal representations responsible for factual generation overlap with those governing refusal.
To uncover the mechanism, the authors conduct a detailed mechanistic analysis. They identify specific attention heads (primarily in middle layers) whose activations encode both hallucination (false‑information) signals and refusal signals. When a low‑rank LoRA module is trained to steer the model away from a learned “hallucination direction,” the same heads shift their activation patterns, effectively moving the model toward higher factuality but also moving it away from the refusal subspace. Consequently, the model becomes more susceptible to jailbreak attempts.
The proposed solution leverages sparse autoencoders (SAEs) trained on head‑level activations to isolate two orthogonal subspaces: a hallucination subspace and a refusal subspace. During fine‑tuning, the loss function is augmented with a regularization term that penalizes any update that projects onto the refusal subspace, thereby preserving the model’s refusal capabilities while still allowing the truthfulness‑enhancing edits to operate within the hallucination subspace.
Experimental evaluation on Llama‑3‑8B‑Instruct shows that this SAE‑guided fine‑tuning maintains or slightly improves TruthfulQA accuracy (comparable to the baseline truthfulness methods) while reducing ASR on AdvBench and StrongReject by 30‑45 percentage points. Importantly, performance on standard commonsense reasoning benchmarks (MMLU, ARC, HellaSwag) remains unchanged or shows modest gains, indicating that the orthogonal regularization does not sacrifice overall utility. The authors also verify that the extracted subspaces are highly sparse (under 0.5 % of total model dimensions), keeping computational overhead minimal.
The paper discusses limitations, including the focus on an 8‑billion‑parameter model and the need to validate the approach on larger models and diverse architectures. It also notes that the refusal subspace may shift with different data distributions, suggesting future work on dynamic subspace adaptation and human‑in‑the‑loop refinement of refusal representations.
In conclusion, the study provides strong evidence that truthfulness and safety are not inherently antagonistic; rather, their apparent conflict arises from shared internal representations. By disentangling these representations with sparse autoencoders and enforcing orthogonal constraints during fine‑tuning, it is possible to improve factuality while preserving, or even strengthening, safety alignment. This method offers a practical pathway for deploying more reliable and trustworthy LLMs in real‑world applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment