Visualizing and Benchmarking LLM Factual Hallucination Tendencies via Internal State Analysis and Clustering

Visualizing and Benchmarking LLM Factual Hallucination Tendencies via Internal State Analysis and Clustering
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) often hallucinate, generating nonsensical or false information that can be especially harmful in sensitive fields such as medicine or law. To study this phenomenon systematically, we introduce FalseCite, a curated dataset designed to capture and benchmark hallucinated responses induced by misleading or fabricated citations. Running GPT-4o-mini, Falcon-7B, and Mistral 7-B through FalseCite, we observed a noticeable increase in hallucination activity for false claims with deceptive citations, especially in GPT-4o-mini. Using the responses from FalseCite, we can also analyze the internal states of hallucinating models, visualizing and clustering the hidden state vectors. From this analysis, we noticed that the hidden state vectors, regardless of hallucination or non-hallucination, tend to trace out a distinct horn-like shape. Our work underscores FalseCite’s potential as a foundation for evaluating and mitigating hallucinations in future LLM research.


💡 Research Summary

The paper introduces FalseCite, a benchmark specifically designed to probe factual hallucinations in large language models (LLMs) when they are presented with misleading or fabricated citations. The dataset is built by merging two public corpora—FEVER, which supplies short false statements about general knowledge, and SciQ, which provides false scientific statements derived from incorrect answer choices. From these sources the authors generate 82 000 false claims and attach citations in two ways: (1) random pairing, where the citation is unrelated to the claim, and (2) semantic pairing, where a citation is selected based on embedding similarity to the claim. This dual‑strategy allows the authors to test whether mere presence of a citation, or a topically coherent but fabricated citation, drives hallucination.

Three models are evaluated: GPT‑4o‑mini (the largest), Falcon‑7B, and Mistral‑7B (both smaller). Because manual annotation was infeasible, the authors use GPT‑4.1 as an “expert” labeler. The expert model cannot verify the authenticity of citations (it has no internet access), so it classifies a response as non‑hallucinated if the citation looks plausible, and as hallucinated if the citation is obviously implausible. Despite this limitation, the expert model achieves 75.2 % accuracy on the HALUEVAL benchmark, which the authors deem sufficient for comparative analysis.

Results show a consistent increase in hallucination rates across all models when false citations are added. Random citations produce the largest jump, while semantically matched citations still cause a notable rise, albeit smaller. GPT‑4o‑mini has the lowest baseline hallucination rate but exhibits the greatest absolute increase (≈40 percentage points) when citations are introduced, suggesting that larger models are more susceptible to being “tricked” by seemingly credible references. Smaller models (Falcon‑7B and Mistral‑7B) display higher baseline hallucination rates but a smaller differential between random and semantic citations, indicating they may rely more on surface cues than on deeper semantic alignment.

Beyond performance metrics, the authors conduct an internal‑state analysis to explore how hallucinations manifest inside the model. For each hallucinated response they identify the five most “influential” layers by computing Spearman correlations between token‑level hallucination labels (provided by the expert model) and three attention‑derived statistics (mean, max, entropy) for each layer. They then extract the aggregated attention vector for each token‑layer pair, concatenate it with the corresponding hidden‑state vector, and obtain a 4 544‑dimensional representation per layer. Non‑hallucinated responses are represented using all 32 layers for comparison.

After reducing dimensionality to 100 via Principal Component Analysis, the authors apply k‑means clustering. Cluster quality is assessed by the proportion of hallucinated responses within each cluster; the optimal number of clusters minimizes the average deviation from 0 % or 100 % hallucination rates. Visualizing the clusters reveals a distinctive “horn‑like” shape: vectors from hallucinated responses tend to trace the outer curve of the horn, while non‑hallucinated vectors occupy the inner region. This pattern suggests that during hallucination the model’s hidden states evolve along a specific trajectory across layers, potentially reflecting a shift toward a subspace associated with fabricated reasoning.

The paper acknowledges several limitations. Relying on GPT‑4.1 for labeling introduces bias, especially regarding citation verification. Human annotations or Retrieval‑Augmented Generation (RAG) models would provide more reliable ground truth. The internal‑state analysis focuses solely on attention‑derived statistics, omitting other signals such as feed‑forward activations, gradient flows, or token‑level logits, which could offer a fuller picture of hallucination dynamics. Moreover, the dataset currently covers only two domains (general knowledge and science) and uses relatively simple citation templates; extending FalseCite to legal, medical, or multi‑citation contexts would improve its ecological validity.

In conclusion, FalseCite offers a novel, citation‑centric benchmark that quantifies how fabricated references amplify factual hallucinations in LLMs. The accompanying internal‑state visualizations provide early evidence of a structured latent trajectory associated with hallucination, opening avenues for future work on detection, mitigation, and model‑architectural interventions.


Comments & Academic Discussion

Loading comments...

Leave a Comment