Knowing What You Know Is Not Enough: Large Language Model Confidences Don't Align With Their Actions

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) are increasingly deployed in agentic and multi-turn workflows where they are tasked to perform actions of significant consequence. In order to deploy them reliably and manage risky outcomes in these settings, it is helpful to access model uncertainty estimates. However, confidence elicitation methods for LLMs are typically not evaluated directly in agentic settings; instead, they are evaluated on static datasets, such as Q&A benchmarks. In this work we investigate the relationship between confidence estimates elicited in static settings and the behavior of LLMs in interactive settings. We uncover a significant action-belief gap – LLMs frequently take actions that contradict their elicited confidences. In a prediction market setting, we find that models often bet against their own high-confidence predictions; in a tool-use setting, models fail to reliably invoke information-seeking tools when their internal confidence is low; and in a user-challenge setting, models change their answers when they have high confidence in them, whilst sticking to answers they have low confidence in. Crucially, we show that static calibration is an insufficient predictor of consistency in the above dynamic settings, as stronger, better calibrated models are somtimes less consistent than their smaller and weaker open-source counterparts. Our results highlight a critical blind spot in current evaluation methodologies: ensuring that a model knows what it knows does not guarantee that it will act rationally on that knowledge.

💡 Research Summary

The paper investigates a critical mismatch between large language models’ (LLMs) self‑reported confidence and the actions they actually take in agentic settings. While many recent works focus on extracting confidence estimates from LLMs and evaluating them on static question‑answering benchmarks using calibration metrics such as Expected Calibration Error (ECE), the authors argue that such static evaluation does not guarantee rational behavior when the model is deployed as an autonomous agent that must make decisions, invoke tools, or respond to user feedback.

To study this “action‑belief gap,” the authors design three simple yet realistic experimental scenarios that mirror common real‑world uses of LLMs:

Utility‑maximization / Prediction market – The model first estimates the probability of a future event (e.g., a company’s revenue exceeding a threshold) and then is asked to place a bet on a market with known odds, aiming to maximize either linear or logarithmic utility. The optimal bet is analytically derived from the model’s own probability estimate. Across seven models (three open‑source: Llama 3.1 8B, Gemma 2 9B, Mistral Small Instruct; four closed‑source: GPT‑4o, GPT‑4o Mini, Gemini 2.5 Pro, Gemini 2.5 Flash) and three confidence‑elicitation methods (logit‑based, sampling, verbal), the authors find that most models place bets far from the optimal amount and often on the opposite side of the market relative to their expressed confidence. Directional consistency never exceeds 79 % and many strong models (e.g., GPT‑4 series) are inconsistent the majority of the time.
Tool‑use – The model receives a factual question from the “no‑context” subset of TriviaQA, reports an answer and its confidence, and is then given the option to call a reliable search tool if it is unsure. The authors measure monotonicity between the rate of tool calls and the reported confidence using Spearman’s ρ. While most models show a positive correlation, none approach perfect (+1) monotonicity; some pairs (e.g., Mistral with verbal confidence, Llama with logit confidence) are essentially uncorrelated. This demonstrates that low confidence does not reliably trigger tool invocation, nor does high confidence suppress it.
User interaction / Challenge – After providing an answer, the model is confronted with a user challenge (“Your answer is incorrect”). The model either sticks to its original answer or defers to a new one. The authors again compute the monotonicity of “sticking rate” versus confidence. Consistent behavior would mean higher sticking rates for high‑confidence answers and higher deferral rates for low‑confidence answers. The results reveal the opposite in many cases: models sometimes change high‑confidence answers while stubbornly defending low‑confidence ones.

Across all experiments, the authors observe that static calibration quality does not predict action‑belief consistency. In fact, well‑calibrated closed‑source models (e.g., Gemini 2.5 Pro) can be less consistent than smaller, weaker open‑source models. This suggests that current evaluation pipelines, which focus solely on “what the model knows,” overlook an orthogonal capability: “how the model uses what it knows.”

The paper’s contributions are threefold: (1) introduction of three minimal yet representative agentic tasks to probe rational behavior; (2) extensive empirical evaluation across multiple model families, sizes, and confidence‑elicitation techniques, consistently revealing the action‑belief gap; (3) analysis showing that neither task performance nor static calibration fully explains the gap, highlighting the need for new evaluation metrics that jointly assess belief and action.

Implications are significant for the deployment of LLMs in high‑stakes domains. Practitioners cannot rely on static confidence estimates to trigger safety mechanisms (e.g., tool calls, human hand‑off). Future work should explore training objectives that directly align confidence with downstream actions, incorporate reinforcement learning with human feedback that penalizes action‑belief inconsistencies, and develop meta‑cognitive architectures that maintain a coherent internal belief state guiding rational decision‑making. By addressing this blind spot, the community can move toward truly trustworthy, agentic LLM systems.

Knowing What You Know Is Not Enough: Large Language Model Confidences Don't Align With Their Actions

💡 Research Summary

Comments & Academic Discussion

Leave a Comment