The Energy of Falsehood: Detecting Hallucinations via Diffusion Model Likelihoods

The Energy of Falsehood: Detecting Hallucinations via Diffusion Model Likelihoods
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) frequently hallucinate plausible but incorrect assertions, a vulnerability often missed by uncertainty metrics when models are confidently wrong. We propose DiffuTruth, an unsupervised framework that reconceptualizes fact verification via non equilibrium thermodynamics, positing that factual truths act as stable attractors on a generative manifold while hallucinations are unstable. We introduce the Generative Stress Test, claims are corrupted with noise and reconstructed using a discrete text diffusion model. We define Semantic Energy, a metric measuring the semantic divergence between the original claim and its reconstruction using an NLI critic. Unlike vector space errors, Semantic Energy isolates deep factual contradictions. We further propose a Hybrid Calibration fusing this stability signal with discriminative confidence. Extensive experiments on FEVER demonstrate DiffuTruth achieves a state of the art unsupervised AUROC of 0.725, outperforming baselines by 1.5 percent through the correction of overconfident predictions. Furthermore, we show superior zero shot generalization on the multi hop HOVER dataset, outperforming baselines by over 4 percent, confirming the robustness of thermodynamic truth properties to distribution shifts.


💡 Research Summary

The paper introduces DiffuTruth, an unsupervised framework for detecting hallucinations in large language models by leveraging the dynamics of discrete text diffusion models. The authors start from a thermodynamic perspective: factual statements are hypothesized to lie in low‑energy attractor regions of the generative manifold, while false statements occupy high‑energy repeller regions. To operationalize this, they propose a “Generative Stress Test.” A claim is first embedded, then perturbed with Gaussian noise up to a focal timestep (approximately 50 % noise). The diffusion model (implemented with DiffuSeq) then performs reverse denoising to reconstruct the claim. True claims, being in‑distribution, are restored with minimal semantic change; false claims, being out‑of‑distribution, are actively “corrected” toward the nearest factual neighbor (e.g., changing an incorrect date to a historically accurate one).

The key insight is that raw reconstruction error (e.g., mean‑squared error in embedding space) is dominated by syntactic similarity and fails to capture the semantic drift that signals falsehood. Instead, the authors define a Semantic Energy metric: they feed the original claim as a premise and the reconstructed claim as a hypothesis into a pretrained Natural Language Inference (NLI) model and take the probability of the “contradiction” label as the energy value. High Semantic Energy indicates that the diffusion process has rejected the input, i.e., the input is likely a hallucination.

To combine this generative signal with the strong discriminative signal of standard NLI classifiers, they introduce a Hybrid Calibration score:
S_hybrid = λ · S_disc + (1 − λ)·(1 − E_sem),
where S_disc is the confidence of a DeBERTa‑v3 based NLI classifier and λ is tuned on a validation set (λ ≈ 0.5).

Experiments are conducted on two benchmarks. FEVER serves as an in‑domain fact‑verification dataset; the diffusion model is fine‑tuned only on the “SUPPORTED” (true) subset, making the approach truly unsupervised with respect to false examples. HO VER is a multi‑hop, out‑of‑distribution dataset used for zero‑shot evaluation. The authors report Area Under the ROC Curve (AUROC) and accuracy for several baselines: random guessing, raw MSE‑based energy, and a direct NLI classifier.

Results on FEVER show that raw MSE yields AUROC 0.541, while Semantic Energy alone reaches 0.640. The Hybrid Calibration achieves the best performance with AUROC 0.725 and 66.1 % accuracy, surpassing the strong discriminative baseline (AUROC 0.710). On HO VER, the direct NLI classifier collapses to AUROC 0.525, whereas DiffuTruth maintains AUROC 0.566, demonstrating superior robustness to distribution shift. These findings support the manifold hypothesis: the low‑energy attractor structure learned by the diffusion model captures fundamental semantic properties that transfer across domains, unlike decision boundaries of supervised classifiers.

The paper’s contributions are threefold: (1) a novel stress‑test procedure that probes the stability of textual claims under diffusion dynamics; (2) the introduction of an NLI‑based Semantic Energy as a principled, meaning‑aware measure of reconstruction drift; (3) a hybrid calibration that fuses generative stability with discriminative confidence, achieving state‑of‑the‑art unsupervised fact verification and better OOD generalization.

Limitations include the computational overhead of diffusion sampling (approximately 143 ms per claim versus 76 ms for the discriminative baseline), and the requirement for a corpus of true statements to train the diffusion model, which may be scarce in specialized domains. Future work is suggested in the directions of more efficient sampling, multimodal diffusion models, and self‑supervised training on unlabeled text to broaden applicability.


Comments & Academic Discussion

Loading comments...

Leave a Comment