The Illusion of Certainty: Uncertainty Quantification for LLMs Fails under Ambiguity
Accurate uncertainty quantification (UQ) in Large Language Models (LLMs) is critical for trustworthy deployment. While real-world language is inherently ambiguous, reflecting aleatoric uncertainty, existing UQ methods are typically benchmarked against tasks with no ambiguity. In this work, we demonstrate that while current uncertainty estimators perform well under the restrictive assumption of no ambiguity, they degrade to close-to-random performance on ambiguous data. To this end, we introduce MAQA* and AmbigQA*, the first ambiguous question-answering (QA) datasets equipped with ground-truth answer distributions estimated from factual co-occurrence. We find this performance deterioration to be consistent across different estimation paradigms: using the predictive distribution itself, internal representations throughout the model, and an ensemble of models. We show that this phenomenon can be theoretically explained, revealing that predictive-distribution and ensemble-based estimators are fundamentally limited under ambiguity. Overall, our study reveals a key shortcoming of current UQ methods for LLMs and motivates a rethinking of current modeling paradigms.
💡 Research Summary
The paper “The Illusion of Certainty: Uncertainty Quantification for LLMs Fails under Ambiguity” investigates a critical gap in current uncertainty quantification (UQ) methods for large language models (LLMs). While LLMs are increasingly deployed in high‑stakes domains, existing UQ techniques have been evaluated almost exclusively on tasks where each question has a single correct answer, i.e., zero aleatoric (intrinsic) uncertainty. The authors ask whether these methods truly capture epistemic (model) uncertainty or merely benefit from the absence of answer ambiguity.
To answer this, they first formalize total uncertainty as the cross‑entropy between the true answer distribution (p^) (over semantically equivalent answer classes) and the model’s predicted distribution (p). This decomposes into aleatoric uncertainty (the entropy (H(p^))) and epistemic uncertainty (the KL divergence (KL(p^* | p))). They then analyze three major families of UQ approaches: (i) consistency‑based methods that use variation in the model’s output (e.g., predictive entropy, temperature scaling), (ii) ensemble‑based methods that estimate mutual information between model parameters and predictions, and (iii) internal‑representation probes.
Theoretical contributions show that when aleatoric uncertainty is zero (i.e., (p^*) is a one‑hot vector), predictive entropy and mutual information are tightly linked to epistemic uncertainty. Theorem 3.1 proves that high predictive entropy necessarily implies high epistemic uncertainty, while Theorem 3.2 provides a probabilistic bound showing that low entropy usually corresponds to low epistemic uncertainty for well‑trained models. Consequently, under the zero‑aleatoric assumption, existing UQ proxies are guaranteed to correlate with true epistemic error, explaining their strong empirical performance in prior work.
However, the authors demonstrate that this guarantee collapses once (H(p^*)>0). When multiple answers are valid, the true distribution lies inside the probability simplex, and high entropy may stem from genuine answer ambiguity rather than model ignorance. In this regime, predictive entropy and ensemble mutual information no longer provide reliable signals about epistemic uncertainty; they conflate aleatoric and epistemic components, leading to “the illusion of certainty.”
To empirically validate the theory, the paper introduces two novel datasets: MAQA* and AmbigQA*. Both consist of real‑world questions that admit multiple correct answers. Ground‑truth answer distributions are estimated from factual co‑occurrence statistics on large web corpora, and answers are clustered into semantic equivalence classes. These datasets enable a principled evaluation of UQ methods under measurable ambiguity.
Experiments span several state‑of‑the‑art LLMs (GPT‑3.5, LLaMA‑2, Claude‑2) and a suite of recent UQ techniques, including temperature‑scaled softmax, Monte‑Carlo dropout, Bayesian ensembles, and hidden‑state variance measures. On standard unambiguous benchmarks, the methods achieve AUROC scores around 0.85–0.90 for distinguishing correct from incorrect predictions. On MAQA* and AmbigQA*, performance collapses to near‑random levels (AUROC ≈ 0.55–0.60). Predictive entropy becomes indistinguishable from a uniform baseline, and ensemble mutual information shows high variance without predictive power. The degradation is consistent across all families, confirming the theoretical claim that current UQ estimators are fundamentally limited in the presence of aleatoric uncertainty.
The paper concludes that existing UQ paradigms are inadvertently tuned to an unrealistic “single‑answer” world and should not be trusted in real deployments where ambiguity is the norm. It outlines three research directions: (1) explicitly modeling aleatoric uncertainty by incorporating true answer distributions into the training objective, (2) designing loss functions that directly minimize KL divergence to (p^) rather than maximizing pointwise likelihood, and (3) developing interactive systems that can request clarification or leverage external knowledge when faced with high ambiguity. By releasing MAQA and AmbigQA*, the authors also provide a benchmark for future work aiming to build uncertainty estimators that genuinely separate epistemic and aleatoric components.
In summary, the study provides both a rigorous theoretical explanation and compelling empirical evidence that current uncertainty quantification methods for LLMs fail under realistic ambiguous conditions, urging a paradigm shift toward uncertainty‑aware modeling and evaluation.
Comments & Academic Discussion
Loading comments...
Leave a Comment