Beyond In-Domain Detection: SpikeScore for Cross-Domain Hallucination Detection

Beyond In-Domain Detection: SpikeScore for Cross-Domain Hallucination Detection
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Hallucination detection is critical for deploying large language models (LLMs) in real-world applications. Existing hallucination detection methods achieve strong performance when the training and test data come from the same domain, but they suffer from poor cross-domain generalization. In this paper, we study an important yet overlooked problem, termed generalizable hallucination detection (GHD), which aims to train hallucination detectors on data from a single domain while ensuring robust performance across diverse related domains. In studying GHD, we simulate multi-turn dialogues following LLMs’ initial response and observe an interesting phenomenon: hallucination-initiated multi-turn dialogues universally exhibit larger uncertainty fluctuations than factual ones across different domains. Based on the phenomenon, we propose a new score SpikeScore, which quantifies abrupt fluctuations in multi-turn dialogues. Through both theoretical analysis and empirical validation, we demonstrate that SpikeScore achieves strong cross-domain separability between hallucinated and non-hallucinated responses. Experiments across multiple LLMs and benchmarks demonstrate that the SpikeScore-based detection method outperforms representative baselines in cross-domain generalization and surpasses advanced generalization-oriented methods, verifying the effectiveness of our method in cross-domain hallucination detection.


💡 Research Summary

The paper tackles the problem of detecting hallucinations generated by large language models (LLMs) when the detector must generalize across domains—a setting the authors call Generalizable Hallucin­ation Detection (GHD). Existing hallucination detectors, whether training‑free (e.g., perplexity‑based) or training‑based (e.g., lightweight classifiers on hidden states), achieve strong results when the training and test data share the same domain, but they degrade sharply when faced with a new domain. GHD formalizes the realistic scenario where a detector is trained on data from a single source domain (e.g., mathematics) and is expected to work reliably on several related but unseen domains (e.g., commonsense, conversational, or code).

The key empirical observation that drives the work is that multi‑turn dialogues initiated from a hallucinated answer exhibit markedly larger fluctuations in model‑internal uncertainty signals than dialogues that start from a factual answer. To expose this behavior, the authors simulate a continuation dialogue: after the LLM produces an initial answer A₁ to a question Q, they feed A₁ back into the model together with a series of carefully crafted follow‑up prompts (P₂ … P_K, K=20). For each turn k they compute an uncertainty score S(A_k) using a representative training‑based detector, SAPLMA, which attaches a small MLP on top of the LLM’s hidden representation and outputs a probability of hallucination.

To capture the “spike” in the score trajectory, they define SpikeScore as the maximum absolute second‑order difference of the score sequence:

 SpikeScore(Q, A₁) = max_{1<k<K‑1} | S(A_{k+1}) – 2·S(A_k) + S(A_{k‑1}) |.

This quantity measures the sharpest curvature (i.e., the most abrupt rise‑and‑fall) in the uncertainty curve. A large SpikeScore indicates a sudden reversal of confidence, which the authors hypothesize correlates with the model’s self‑contradiction and correction mechanisms that are triggered more often when the initial answer is false.

Theoretical analysis (Theorem 1) shows that, under mild assumptions on the mean and variance of the underlying score distribution, SpikeScore provides a probabilistic lower bound on separability between hallucinated and non‑hallucinated answer chains. In other words, there exists a threshold τ such that P(SpikeScore > τ | hallucination) is significantly higher than P(SpikeScore ≤ τ | factual), and this bound is shown to be largely domain‑invariant.

Empirically, the authors evaluate four LLM families (Llama‑3.2‑3B, Llama‑3.1‑8B, Qwen‑3‑8B, Qwen‑3‑14B) on six benchmark datasets covering a wide range of tasks: TriviaQA, CommonsenseQA, Belebele, CoQA, Math, and SVAMP. For each dataset they train the detector on that domain and test on the remaining five, yielding six cross‑domain configurations. SpikeScore‑based detection consistently outperforms strong baselines, including the original SAPLMA, SEP, PRISM, and ICR‑Probe, achieving improvements of 5–12 percentage points in AUROC and F1 across most settings. Moreover, combining SpikeScore with the underlying SAPLMA or SEP scores yields further gains, while purely training‑free methods (perplexity, Reasoning Score, In‑Context Sharpness) lag behind.

The paper’s contributions can be summarized as follows:

  1. Problem Definition – Formalizes GHD, highlighting the need for domain‑generalizable hallucination detectors.
  2. Phenomenon Discovery – Shows that hallucination‑initiated multi‑turn dialogues produce larger uncertainty spikes than factual ones, a pattern that holds across models, tasks, and scales.
  3. Metric Design – Introduces SpikeScore, a simple yet effective statistic (max second‑order difference) that captures local instability in uncertainty trajectories.
  4. Theoretical Guarantees – Provides a probabilistic separability bound for SpikeScore, supporting its domain‑invariant nature.
  5. Extensive Validation – Demonstrates superior cross‑domain performance on multiple LLMs and six diverse benchmarks, surpassing both existing training‑based and training‑free detectors.

In conclusion, the work offers a novel, theoretically grounded, and empirically validated approach to hallucination detection that remains robust when the test domain differs from the training domain. By leveraging the intrinsic instability of LLMs in multi‑turn interactions, SpikeScore serves as a domain‑agnostic indicator of unreliability, opening avenues for real‑time, cross‑domain safety layers in LLM‑driven applications. Future research may explore integrating SpikeScore with other uncertainty signals (e.g., token‑level entropy, attention variance) or deploying it in live user‑interaction settings to further enhance the trustworthiness of generative AI systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment