From Out-of-Distribution Detection to Hallucination Detection: A Geometric View

From Out-of-Distribution Detection to Hallucination Detection: A Geometric View
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Detecting hallucinations in large language models is a critical open problem with significant implications for safety and reliability. While existing hallucination detection methods achieve strong performance in question-answering tasks, they remain less effective on tasks requiring reasoning. In this work, we revisit hallucination detection through the lens of out-of-distribution (OOD) detection, a well-studied problem in areas like computer vision. Treating next-token prediction in language models as a classification task allows us to apply OOD techniques, provided appropriate modifications are made to account for the structural differences in large language models. We show that OOD-based approaches yield training-free, single-sample-based detectors, achieving strong accuracy in hallucination detection for reasoning tasks. Overall, our work suggests that reframing hallucination detection as OOD detection provides a promising and scalable pathway toward language model safety.


💡 Research Summary

**
The paper “From Out‑of‑Distribution Detection to Hallucination Detection: A Geometric View” reframes hallucination detection in large language models (LLMs) as an out‑of‑distribution (OOD) detection problem. By treating next‑token prediction as a high‑dimensional linear classification task, the authors adapt two lightweight OOD detectors—Neural‑Collapse‑Inspired (NCI) and fast Decision‑Boundary‑based Detector (fDBD)—to the hallucination detection setting.

Key technical ideas

  1. Geometric interpretation of token generation – At each decoding step the LLM produces a penultimate‑layer embedding z that is linearly projected by the language head into logits for every vocabulary token. The most confident token ĉ is obtained by arg‑max over these logits. This view allows the use of classic OOD uncertainty measures that rely on the relationship between embeddings, class weight vectors, and decision boundaries.

  2. Feature‑Proximity (NCI) – The NCI score is the cosine similarity between the embedding (centered by a mean feature vector μ_G) and the weight vector of the predicted token. Lower similarity indicates higher uncertainty. Traditional OOD methods estimate μ_G from training data, which is infeasible for LLMs. The authors propose an analytical proxy called the “Decision‑Neutral Closest Point” (DNCP). DNCP is defined as the point that minimizes the variance of logits across the entire vocabulary and can be computed directly from the weight matrix W and bias b as
    \


Comments & Academic Discussion

Loading comments...

Leave a Comment