Reasoning aligns language models to human cognition
Do language models make decisions under uncertainty like humans do, and what role does chain-of-thought (CoT) reasoning play in the underlying decision process? We introduce an active probabilistic reasoning task that cleanly separates sampling (actively acquiring evidence) from inference (integrating evidence toward a decision). Benchmarking humans and a broad set of contemporary large language models against near-optimal reference policies reveals a consistent pattern: extended reasoning is the key determinant of strong performance, driving large gains in inference and producing belief trajectories that become strikingly human-like, while yielding only modest improvements in active sampling. To explain these differences, we fit a mechanistic model that captures systematic deviations from optimal behavior via four interpretable latent variables: memory, strategy, choice bias, and occlusion awareness. This model places humans and models in a shared low-dimensional cognitive space, reproduces behavioral signatures across agents, and shows how chain-of-thought shifts language models toward human-like regimes of evidence accumulation and belief-to-choice mapping, tightening alignment in inference while leaving a persistent gap in information acquisition.
💡 Research Summary
This paper asks two fundamental questions: do large language models (LLMs) make decisions under uncertainty in the same way humans do, and what role does chain‑of‑thought (CoT) reasoning play in that process? To answer these questions, the authors design an “active probabilistic reasoning” task that cleanly separates two core components of decision‑making: sampling (actively acquiring evidence) and inference (integrating evidence to reach a final decision).
In each trial, four buttons (A‑D) are presented, one of which is biased toward a RED outcome (probability 0.9) while the others are unbiased (probability 0.5). Over a variable number of rounds (N = 2…15) the agent selects a button, observes a binary outcome (RED/GREEN), and proceeds to the next round. Crucially, on each round 0‑3 buttons may be occluded, forcing the agent to choose among a subset of available actions. After the sampling phase, a single inference round requires the agent to commit to the button it believes is biased. The task therefore isolates evidence acquisition from the reward‑maximizing aspect of classic multi‑armed bandits; performance depends solely on the quality of the final inference.
Human participants (50 recruited, 4,600 games total) performed the task via a graphical interface. LLMs were given an equivalent text‑based version, with the same instructions and constraints, and were evaluated under two reasoning conditions: a low‑effort baseline and an “Extended Reasoning” condition that encourages longer CoT token sequences. The model suite spans the current landscape: OpenAI’s GPT‑4o mini, GPT‑5 mini, various GPT‑OSS sizes, LLaMA variants (including behavior‑fine‑tuned versions), DeepSeek, Anthropic Claude, Google Gemini and Gemma, Qwen (dense and Mixture‑of‑Experts), as well as open‑source Apertus, xAI Grok, and GLM 4.5. In total more than 55,000 games were collected across models.
Performance is quantified along three axes. (1) Overall success rate (percentage of trials where the biased button is correctly identified). (2) Sampling quality, measured as the loss relative to an optimal reference agent (PPO sampling + MAP inference) after accounting for the evidence actually gathered. (3) Inference quality, measured as the loss of the same agent when it applies a MAP decision rule to the evidence it collected, thereby isolating the integration step. Additionally, the frequency of “invalid choices” (selecting occluded buttons, producing tokens outside A‑D, or failing to respond) is reported.
Results show a consistent pattern. Extended CoT reasoning dramatically reduces inference loss for virtually all models, often bringing inference quality to or above the human median. However, sampling loss improves only modestly; even the best‑performing LLMs still fall short of the optimal sampling policy. Invalid choices are more common during the sampling phase than at the final decision, and CoT reduces their occurrence. In other words, chain‑of‑thought primarily helps models better integrate the evidence they have, but does not substantially change how they acquire that evidence.
To move beyond descriptive metrics, the authors fit a mechanistic model with four interpretable latent variables that capture systematic deviations from optimal Bayesian behavior:
- Memory (β) – how well past observations are retained in the belief state.
- Strategy (κ) – the stochasticity of button selection, reflecting exploration versus exploitation tendencies.
- Choice Bias (ω) – a systematic preference for particular buttons independent of evidence.
- Occlusion Awareness (θ) – the ability to avoid selecting occluded options.
The model assumes a uniform prior over the four hypotheses and updates a posterior vector via Bayes’ rule after each observation. By fitting β, κ, ω, and θ to each participant and each model, the authors embed humans and LLMs in a shared low‑dimensional cognitive space. Human participants cluster with high memory, moderate strategy, low bias, and strong occlusion awareness. LLMs, especially in the low‑effort condition, exhibit low memory (rapid forgetting of earlier samples) and suboptimal strategy (either overly random or overly deterministic sampling). When CoT is enabled, β and κ shift toward the human cluster, indicating that extended reasoning improves evidence retention and leads to more human‑like sampling patterns. Nevertheless, θ and ω remain distinct: models still show occasional bias toward certain button labels and are less adept at recognizing occluded options.
The paper’s contributions are threefold: (1) a novel task that cleanly disentangles sampling from inference, enabling algorithmic‑level analysis of decision‑making; (2) a comprehensive benchmark of state‑of‑the‑art LLMs against humans under identical conditions, revealing that CoT chiefly benefits inference while leaving a persistent gap in active information acquisition; (3) a mechanistic latent‑variable model that quantitatively maps both humans and models into a shared cognitive space, demonstrating how CoT moves LLMs toward human‑like regimes of belief updating but does not fully close the gap in sampling strategies.
In conclusion, chain‑of‑thought reasoning is a powerful tool for aligning LLMs with human‑like inference processes, but the ability to actively gather optimal evidence remains limited. Future work should target the sampling component, perhaps via meta‑learning of exploration policies, dedicated prompting for information‑seeking, or reinforcement‑learning fine‑tuning that explicitly rewards efficient evidence acquisition. Bridging this remaining divide will be essential for deploying LLMs as truly human‑compatible decision‑making agents.
Comments & Academic Discussion
Loading comments...
Leave a Comment