Benchmarking Uncertainty Calibration in Large Language Model Long-Form Question Answering

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) are commonly used in Question Answering (QA) settings, increasingly in the natural sciences if not science at large. Reliable Uncertainty Quantification (UQ) is critical for the trustworthy uptake of generated answers. Existing UQ approaches remain weakly validated in scientific QA, a domain relying on fact-retrieval and reasoning capabilities. We introduce the first large-scale benchmark for evaluating UQ metrics in reasoning-demanding QA studying calibration of UQ methods, providing an extensible open-source framework to reproducibly assess calibration. Our study spans up to 20 large language models of base, instruction-tuned and reasoning variants. Our analysis covers seven scientific QA datasets, including both multiple-choice and arithmetic question answering tasks, using prompting to emulate an open question answering setting. We evaluate and compare methods representative of prominent approaches on a total of 685,000 long-form responses, spanning different reasoning complexities representative of domain-specific tasks. At the token level, we find that instruction tuning induces strong probability mass polarization, reducing the reliability of token-level confidences as estimates of uncertainty. Models further fine-tuned for reasoning are exposed to the same effect, but the reasoning process appears to mitigate it depending on the provider. At the sequence level, we show that verbalized approaches are systematically biased and poorly correlated with correctness, while answer frequency (consistency across samples) yields the most reliable calibration. In the wake of our analysis, we study and report the misleading effect of relying exclusively on ECE as a sole measure for judging performance of UQ methods on benchmark datasets. Our findings expose critical limitations of current UQ methods for LLMs and standard practices in benchmarking thereof.

💡 Research Summary

The paper presents the first large‑scale benchmark for evaluating uncertainty quantification (UQ) methods in long‑form scientific question answering (QA) with large language models (LLMs). Recognizing that reliable uncertainty estimates are essential for trustworthy scientific AI—especially to mitigate hallucinations—the authors systematically assess how well several prominent UQ approaches are calibrated when models must retrieve facts and perform multi‑step reasoning.

Scope and Dataset
The study spans up to 20 LLMs, covering base, instruction‑tuned, and reasoning‑fine‑tuned variants from multiple providers. Seven scientific QA datasets are used, including four multiple‑choice (physics, chemistry, biology) and three arithmetic/math datasets, each designed to allow verifiable ground truth. In total, 685,000 long‑form responses are generated using prompting strategies that emulate open‑ended QA (e.g., APriCoT).

Research Questions

To what extent are token‑level probabilities calibrated, and how do instruction‑tuning or reasoning fine‑tuning affect this calibration?
How reliable are sequence‑level UQ methods (verbalized uncertainty, P(True), and answer‑consistency) for long‑form scientific answers?

Token‑Level Findings

Base models exhibit relatively smooth probability distributions and modest Expected Calibration Error (ECE).
Instruction‑tuned models display severe probability mass polarization: the softmax collapses onto a single token, inflating confidence scores irrespective of correctness. This dramatically worsens calibration.
Reasoning‑fine‑tuned models (e.g., Chain‑of‑Thought) sometimes mitigate polarization, but the effect is provider‑dependent and not universal.

Sequence‑Level Findings
Three families of methods are compared:

Verbalized Uncertainty – prompting the model to state its confidence. Results show systematic bias; models often produce over‑confident linguistic cues that correlate poorly with actual accuracy.
P(True) / probability aggregation – aggregating token‑level logits across the generated answer. Because token‑level polarization propagates, this method inherits the same calibration problems.
Answer Consistency (Frequency across samples) – generating multiple samples per question and using the most frequent answer as a confidence proxy. This approach consistently yields the lowest ECE and the highest correlation with correctness across all datasets, especially on multi‑step reasoning tasks where uncertainty compounds.

Critique of ECE‑Only Evaluation
The authors demonstrate that relying solely on ECE can be misleading: models with extreme polarization may achieve deceptively low ECE after binning, while their raw confidence scores remain uninformative. They advocate for complementary diagnostics such as reliability diagrams, Brier scores, and calibration plots.

Framework and Reproducibility
An extensible open‑source benchmarking framework is released, allowing researchers to plug in new models, datasets, prompts, or UQ methods with minimal effort. All raw uncertainty scores, scripts, and visualizations are provided for full reproducibility.

Conclusions

Instruction tuning harms token‑level calibration by concentrating probability mass, reducing the usefulness of raw token confidences.
Reasoning fine‑tuning can partially offset this effect but does not guarantee calibrated outputs.
Verbalized uncertainty and simple probability aggregation are systematically biased and unreliable for scientific long‑form QA.
Consistency‑based metrics, which capture variability across sampled generations, emerge as the most robust calibration signal.
Evaluations should move beyond a single scalar like ECE to a richer suite of calibration diagnostics.

Overall, the work uncovers critical limitations of current UQ practices for LLMs in high‑stakes scientific domains and provides a solid, reproducible benchmark to guide future research toward more trustworthy uncertainty estimation.

Benchmarking Uncertainty Calibration in Large Language Model Long-Form Question Answering

💡 Research Summary

Comments & Academic Discussion

Leave a Comment