Direct Confidence Alignment: Aligning Verbalized Confidence with Internal Confidence In Large Language Models
Producing trustworthy and reliable Large Language Models (LLMs) has become increasingly important as their usage becomes more widespread. Calibration seeks to achieve this by improving the alignment between the model’s confidence and the actual likelihood of its responses being correct or desirable. However, it has been observed that the internal confidence of a model, derived from token probabilities, is not well aligned with its verbalized confidence, leading to misleading results with different calibration methods. In this paper, we propose Direct Confidence Alignment (DCA), a method using Direct Preference Optimization to align an LLM’s verbalized confidence with its internal confidence rather than ground-truth accuracy, enhancing model transparency and reliability by ensuring closer alignment between the two confidence measures. We evaluate DCA across multiple open-weight LLMs on a wide range of datasets. To further assess this alignment, we also introduce three new calibration error-based metrics. Our results show that DCA improves alignment metrics on certain model architectures, reducing inconsistencies in a model’s confidence expression. However, we also show that it can be ineffective on others, highlighting the need for more model-aware approaches in the pursuit of more interpretable and trustworthy LLMs.
💡 Research Summary
This paper introduces “Direct Confidence Alignment (DCA),” a novel method aimed at improving the transparency and reliability of Large Language Models (LLMs) by aligning their verbalized confidence with their internal confidence. The core problem addressed is the frequent misalignment between an LLM’s “verbalized confidence” (Cv)—the certainty level it states in its output (e.g., “I am 80% confident”)—and its “internal confidence” (Ci)—derived from the token probabilities computed during generation. This inconsistency can mislead users about the model’s true certainty and complicates calibration efforts.
Unlike traditional calibration methods that align a model’s confidence with ground-truth accuracy, DCA uses internal confidence (Ci) as the reference signal. The method leverages Direct Preference Optimization (DPO), an efficient alternative to reinforcement learning from human feedback. The training process involves creating a preference dataset from model generations. For a given question, the model generates a formatted answer containing its guess and a verbalized confidence probability (Cv). The internal confidence (Ci) is extracted from the token probability of the answer choice. A “chosen” response is created by overwriting the original Cv in the answer with the Ci value, while the original response becomes the “rejected” one. DPO is then used to train the model to prefer responses where the verbalized confidence matches the internal confidence.
The authors evaluated DCA across three open-weight, instruction-tuned LLMs: Mistral-7B-Instruct, Gemma-2-9B-Instruct, and Llama-3.2-3B-Instruct. Experiments were conducted on a diverse set of QA datasets: OpenBookQA, TruthfulQA, CosmosQA, and MMLU. To assess alignment, they employed Spearman’s rank correlation coefficient (ρ) and introduced three new metrics based on calibration error (ε = Cv - Ci): the standard deviation of ε (σϵ), the mean absolute ε (|ε|), and the standard error of the mean ε (σM). These metrics provide a multifaceted view of alignment, measuring correlation, deviation magnitude, variability, and estimation stability.
Results were model-dependent. Gemma-2-9B-Instruct showed the most consistent and significant improvements after DCA across all datasets, with increased ρ and substantially decreased |ε|, indicating stronger and more accurate alignment between Cv and Ci. However, the analysis notes that Gemma’s initial confidence distributions were heavily skewed toward high values (90-100%), suggesting DCA may have reinforced an existing bias. In contrast, results for Mistral-7B-Instruct and Llama-3.2-3B-Instruct were mixed. For these models, DCA improved alignment on some datasets but degraded it on others, sometimes severely increasing |ε| or decreasing ρ. This indicates that DCA’s effectiveness is not universal and can be ineffective or even harmful for certain model architectures.
The downstream impact on task accuracy was also examined. While not an explicit goal of DCA, accuracy changes varied: Gemma’s accuracy remained stable, Mistral’s decreased on some datasets (notably TruthfulQA), and Llama’s accuracy increased across the board. This further underscores the model-specific effects of the intervention.
In conclusion, the paper demonstrates that DCA can successfully align verbalized and internal confidence for some LLMs, enhancing transparency. However, its inconsistent results across different models highlight a critical limitation: the method’s efficacy is highly contingent on the underlying model architecture and its inherent confidence expression mechanisms. This finding argues for the development of more model-aware calibration and alignment techniques in the pursuit of truly trustworthy and interpretable AI systems. The paper also acknowledges limitations, including DCA’s reliance on access to model logits and its dependence on the internal confidence being a reasonably meaningful signal in the first place.
Comments & Academic Discussion
Loading comments...
Leave a Comment