Evaluating and Calibrating LLM Confidence on Questions with Multiple Correct Answers
Confidence calibration is essential for making large language models (LLMs) reliable, yet existing training-free methods have been primarily studied under single-answer question answering. In this paper, we show that these methods break down in the presence of multiple valid answers, where disagreement among equally correct responses leads to systematic underestimation of confidence. To enable a systematic study of this phenomenon, we introduce MACE, a benchmark of 12,000 factual questions spanning six domains with varying numbers of correct answers. Experiments across 15 representative calibration methods and four LLM families (7B-72B) reveal that while accuracy increases with answer cardinality, estimated confidence consistently decreases, causing severe miscalibration for questions with mixed answer counts. To address this issue, we propose Semantic Confidence Aggregation (SCA), which aggregates confidence over multiple high-probability sampled responses. SCA achieves state-of-the-art calibration performance under mixed-answer settings while preserving strong calibration on single-answer questions.
💡 Research Summary
This paper investigates confidence calibration for large language models (LLMs) in the under‑explored setting of factual questions that admit multiple correct answers. While prior work on calibration has focused almost exclusively on single‑answer question answering, the authors demonstrate that existing training‑free calibration methods—particularly those based on response consistency—systematically underestimate confidence when several equally valid answers exist. The core problem stems from the assumption that high agreement among sampled outputs signals correctness; in multi‑answer scenarios, correct answers naturally diverge, leading to low estimated confidence even though the model possesses the relevant knowledge.
To study this phenomenon rigorously, the authors introduce MACE (Multi‑Answer Confidence Estimation), a benchmark comprising 12,000 question‑answer pairs across six factual domains (Awards, Political Office, Regional Affiliation, Mathematical Concepts, Rivers, and Language). For each domain, they generate 500 questions with exactly 1, 2, 4, or 6 correct answers, ensuring clear, complete, and verified ground‑truth sets. The construction pipeline involves (1) collecting subject‑relation‑object triples from Wikidata (or rule‑based generation for math), (2) applying popularity and validity filters, (3) manual expert verification (achieving Cohen’s κ = 0.94), and (4) templated natural‑language QA generation. This design isolates the effect of answer cardinality while keeping difficulty and knowledge type constant.
The experimental suite evaluates 15 representative calibration methods, spanning three families: (i) probability‑based (token‑level entropy, length‑normalized entropy, semantic entropy), (ii) verbalized confidence (direct confidence prompts, top‑k confidence scores), and (iii) consistency‑based (agreement across multiple generations, optionally weighted by verbalized scores). Methods are further split into single‑turn (confidence derived from the initial generation) and double‑turn (a secondary verification query). The study covers four LLM families—Qwen2.5‑Instruct (7B‑72B), LLaMA3.1‑Instruct (8B‑70B), DeepSeek‑V3, and closed‑source GPT‑4o‑mini/4o—providing a broad view across model scale and architecture.
Key findings: (1) As the number of correct answers increases, model accuracy improves (because any correct answer suffices), yet estimated confidence consistently drops across all methods. (2) In realistic mixed‑cardinality settings, methods that achieve state‑of‑the‑art calibration on single‑answer questions collapse, exhibiting severe miscalibration on multi‑answer items. (3) Larger models (e.g., 70B) suffer more pronounced confidence degradation, likely because they generate a wider variety of correct answers, reducing inter‑sample agreement. (4) Consistency‑based approaches, previously the best on single‑answer QA, are the most vulnerable to this effect.
To remedy the systematic under‑confidence, the authors propose Semantic Confidence Aggregation (SCA). Rather than relying on the most confident single response, SCA aggregates the generation probabilities of multiple high‑confidence sampled answers. Concretely, for each sampled response the method computes the full token‑level sequence probability (the product of per‑token softmax scores). Responses whose probability falls below a modest threshold contribute negligibly, while all high‑probability answers are summed, yielding an aggregated confidence score that reflects the total probability mass assigned to any correct answer. This simple summation leverages the fact that low‑confidence (often incorrect) samples have minimal impact, while correctly capturing the dispersed probability mass across several valid answers.
Empirical results show that SCA attains state‑of‑the‑art calibration under mixed‑answer conditions while preserving strong performance on single‑answer questions. On the 4‑answer (4a) and 6‑answer (6a) subsets, SCA improves AUROC from ~0.55 (baseline consistency methods) to ~0.72–0.73 and reduces Expected Calibration Error (ECE) from ~0.12 to below 0.05. Importantly, these gains are consistent across all four LLM families and scale levels, demonstrating robustness and generality.
In summary, the paper makes three major contributions: (1) it highlights a critical blind spot in LLM confidence calibration—systematic under‑confidence for multi‑answer questions—backed by a carefully curated benchmark; (2) it provides a comprehensive evaluation of 15 calibration techniques across diverse models, revealing that the best single‑answer methods fail dramatically in realistic mixed‑cardinality scenarios; and (3) it introduces SCA, a lightweight, probability‑aggregation approach that restores calibrated confidence without additional training or model modifications. The work thus advances the reliability of LLMs toward real‑world applications where questions often admit multiple correct answers, and where well‑calibrated confidence estimates are essential for downstream decision‑making, risk assessment, and human‑AI interaction.
Comments & Academic Discussion
Loading comments...
Leave a Comment