A Geometric Taxonomy of Hallucinations in LLMs
The term “hallucination” converge different failure modes with specific geometric signatures in embedding space. We propose a taxonomy identifying three types: unfaithfulness (Type I: ignoring provided context), confabulation (Type II: inventing semantically foreign content), and factual error (Type III: wrong details within correct conceptual frames). We introduce two detection methods grounded in this taxonomy: the Semantic Grounding Index (SGI) for Type I, which measures whether a response moves toward provided context on the unit hypersphere, and the Directional Grounding Index (DGI) for Type II, which measures displacement geometry in context-free settings. DGI achieves AUROC=0.958 on human-crafted confabulations with 3.8% cross-domain degradation. External validation on three independently collected human-annotated benchmarks -WikiBio GPT-3, FELM, and ExpertQA- yields domain-specific AUROC 0.581-0.695, with DGI outperforming an NLI CrossEncoder baseline on expert-domain data, where surface entailment operates at chance. On LLM-generated benchmarks, detection is domain-local. We examine the Type III boundary through TruthfulQA, where apparent classifier signal (Logistic Regression with AUROC 0.731) is traced to a stylistic annotation confound: false answers are geometrically closer to queries than truthful ones, a pattern incompatible with factual-error detection. This identifies a theoretical constraint from a methodological limitation.
💡 Research Summary
This paper tackles the pervasive problem of “hallucinations” in large language models (LLMs) by proposing a geometric taxonomy that distinguishes three fundamentally different failure modes and by introducing two detection metrics that operate solely on single‑embedding calls. The authors argue that the term “hallucination” conflates at least three distinct phenomena, each leaving a characteristic signature on the unit hypersphere (S^d‑1) of normalized sentence embeddings.
Type I – Unfaithfulness (ignoring provided context).
When a model generates a response that stays close to the original query rather than moving toward the supplied context, the response is said to be unfaithful. The Semantic Grounding Index (SGI) quantifies this behavior as the ratio of angular distances θ(r,q)/θ(r,c), where θ denotes the geodesic distance on S^d‑1. An SGI > 1 indicates that the response has shifted toward the context, while SGI ≤ 1 signals “semantic laziness.” Experiments on the HaluEval QA benchmark (10 k query‑context‑response triples) across five embedding models show that grounded responses consistently yield SGI ≈ 1.18, whereas unfaithful responses cluster around 0.91, confirming the metric’s discriminative power.
Type II – Confabulation (inventing semantically foreign content).
Here the model fabricates entities, mechanisms, or institutions that do not exist. The authors define the Directional Grounding Index (Γ, also called DGI) as the dot product between the normalized displacement vector δ̂ = (ϕ(r) − ϕ(q))/‖ϕ(r) − ϕ(q)‖ and a learned grounding direction μ̂ obtained by averaging δ̂ over a reference set of verified grounded pairs. Γ ∈
Comments & Academic Discussion
Loading comments...
Leave a Comment