Do LLMs Act Like Rational Agents? Measuring Belief Coherence in Probabilistic Decision Making
Large language models (LLMs) are increasingly deployed as agents in high-stakes domains where optimal actions depend on both uncertainty about the world and consideration of utilities of different outcomes, yet their decision logic remains difficult to interpret. We study whether LLMs are rational utility maximizers with coherent beliefs and stable preferences. We consider behaviors of models for diagnosis challenge problems. The results provide insights about the relationship of LLM inferences to ideal Bayesian utility maximization for elicited probabilities and observed actions. Our approach provides falsifiable conditions under which the reported probabilities \emph{cannot} correspond to the true beliefs of any rational agent. We apply this methodology to multiple medical diagnostic domains with evaluations across several LLMs. We discuss implications of the results and directions forward for uses of LLMs in guiding high-stakes decisions.
💡 Research Summary
The paper investigates whether large language models (LLMs) behave as rational agents when faced with high‑stakes decision problems, focusing on medical diagnosis tasks. The authors formalize the decision problem using the classic expected‑utility framework: an unknown world state θ is drawn from a prior, observations x are generated, and a decision maker forms a subjective posterior belief P_S(θ|x). The agent then selects an action a that maximizes the expected utility u(a,θ). Recognizing that real agents (including LLMs) often deviate from perfect utility maximization, the authors incorporate two well‑studied extensions: the random‑utility model (RUM), which adds action‑specific noise ε_a, and prospect‑theoretic random‑utility model (PT‑RUM), which applies a monotone probability‑weighting function w(·) to the subjective belief before utility evaluation.
The central research question is whether the probabilities that an LLM explicitly reports (denoted P_E(θ|x)) can plausibly be its true subjective beliefs P_S(θ|x) while still producing the observed actions. To answer this, the authors derive two families of testable implications that must hold for any rational decision maker under the above models:
-
Conditional Independence (CI) of actions and outcomes given the reported belief. If the reported belief fully captures the agent’s information, the realized state θ should provide no additional predictive power for the action once we condition on P_E. Formally, a ⟂ θ | P_E(θ|x). Violations imply that the model is using hidden information about θ that it does not disclose in its probability estimate. The authors test this using a non‑parametric Conditional Mutual Information (CMI) test and, more concretely, by comparing two predictive models of the action: one that conditions only on the reported probability p = P_E(θ=1|x) (and possibly the evidence vector x) and a second that also conditions on the true state θ. Under CI, adding θ should not improve predictive performance; any statistically significant reduction in log‑loss when θ is added signals a CI violation.
-
Monotonicity of choice probabilities. In binary diagnosis settings (θ∈{0,1}), rational agents should exhibit monotone behavior: as the reported probability of disease increases, the probability of choosing a “risky” action (e.g., immediate treatment) should not decrease, and the probability of a “conservative” action (e.g., waiting for more tests) should not increase. This property holds for both RUM and PT‑RUM (the latter allowing for a transformed probability w(p) but preserving monotonicity). The authors test monotonicity using logistic regression and rank‑correlation analyses across the dataset.
Experimental Design
The authors evaluate four medical diagnostic domains (e.g., pneumonia, myocardial infarction, diabetic complications, stroke) using simulated patient data where the true disease state θ is known. For each case x, they query several state‑of‑the‑art LLMs (GPT‑4, Claude, Llama 2, Gemini) in two separate prompts: (i) a probability elicitation prompt asking for a numeric estimate of disease likelihood, and (ii) an action prompt asking which clinical step the model would take (order a test, prescribe treatment, defer, etc.). This yields a dataset of tuples (x, p = P_E(θ=1|x), a, θ).
Results
Across models, the CI test reveals systematic violations. Adding the true state θ to the action predictor reduces out‑of‑sample log‑loss by 5–12 % on average, with bootstrap confidence intervals excluding zero, indicating that the reported probabilities are not sufficient statistics for the decision. The effect persists even after conditioning on the full evidence vector x, suggesting that the models retain hidden information about θ that they do not verbalize.
Monotonicity tests show mixed outcomes. GPT‑4 largely respects monotonicity, with only a small fraction of cases (≈5 %) showing inverse ordering. In contrast, Llama 2 and Claude frequently exhibit non‑monotonic patterns, especially in complex symptom combinations where the model sometimes recommends aggressive treatment despite a lower reported disease probability. This indicates that PT‑RUM assumptions (which allow probability weighting) are insufficient to explain the observed behavior for these models.
Interpretation and Implications
The authors argue that the observed CI violations likely stem from a mismatch between the model’s internal representation of uncertainty (e.g., token‑level logits) and the surface‑level natural‑language probability it is asked to produce. Fine‑tuning data, prompt phrasing, and the model’s training objective may bias the model toward “plausible‑sounding” probabilities rather than faithful reports of its internal belief state. Consequently, using LLM‑generated probabilities as inputs to downstream decision‑theoretic pipelines (e.g., triage systems) may be unsafe unless the belief–action consistency is verified.
The paper contributes a novel, fully black‑box methodology for assessing belief coherence that complements existing calibration and discrimination metrics. By focusing on decision‑linked statistical constraints, the approach can be applied to closed‑source models where internal activations are unavailable.
Future Directions
The authors propose several extensions: (1) integrating mechanistic interpretability tools (e.g., probing hidden states) to directly compare internal logits with elicited probabilities, (2) estimating model‑specific utility functions or probability‑weighting curves from observed behavior to better characterize risk attitudes, and (3) conducting human‑LLM comparative studies in real clinical settings to assess whether LLMs can achieve belief–action coherence comparable to expert physicians.
Overall, the study demonstrates that while some LLMs (notably GPT‑4) approach rational‑agent behavior in certain medical tasks, most current models fail to satisfy fundamental coherence conditions, highlighting the need for rigorous validation before deploying LLMs in high‑stakes decision environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment