Using Reinforcement Learning to Train Large Language Models to Explain Human Decisions
A central goal of cognitive modeling is to develop models that not only predict human behavior but also provide insight into the underlying cognitive mechanisms. While neural network models trained on large-scale behavioral data often achieve strong predictive performance, they typically fall short in offering interpretable explanations of the cognitive processes they capture. In this work, we explore the potential of pretrained large language models (LLMs) to serve as dual-purpose cognitive models–capable of both accurate prediction and interpretable explanation in natural language. Specifically, we employ reinforcement learning with outcome-based rewards to guide LLMs toward generating explicit reasoning traces for explaining human risky choices. Our findings demonstrate that this approach produces high-quality explanations alongside strong quantitative predictions of human decisions.
💡 Research Summary
The paper tackles a long‑standing dilemma in cognitive modeling: building models that not only predict human behavior with high accuracy but also provide interpretable, theory‑level explanations of the underlying cognitive processes. While deep neural networks trained on large behavioral datasets have achieved impressive predictive performance, they typically offer little insight into why a decision is made. The authors propose to leverage the chain‑of‑thought (CoT) capability of pretrained large language models (LLMs) to serve as dual‑purpose cognitive models that generate both predictions and natural‑language reasoning traces for human risky‑choice behavior.
The experimental platform is a dataset of 13 k risky‑choice problems (choices13k) originally collected by Peterson et al. (2021). Each problem presents two options (A and B) with probabilistic monetary outcomes, described in natural language. The backbone model is Qwen‑2.5‑7B‑Instruct, a 7‑billion‑parameter LLM. To keep fine‑tuning efficient, low‑rank adaptation (LoRA) modules are inserted into every linear layer, resulting in roughly 80 M trainable parameters (≈1 % of the total).
Three post‑training strategies are compared:
-
Standard Supervised Fine‑Tuning (SFT) – the model is trained to map the natural‑language description directly to a JSON object containing the empirical choice percentages for options A and B.
-
Centaur‑style SFT – a variant introduced by Binz et al. (2025) that masks all tokens except those inside special brackets (“« »”). Only the numeric choice tokens contribute to the loss, encouraging the model to focus on the behavioral output while still using the language model’s knowledge.
-
Reinforcement Learning (RL) with Group Relative Policy Optimization (GRPO) – the core contribution. For each training example the model generates 12 candidate completions, each consisting of a CoT segment followed by a JSON prediction. An outcome‑based reward is defined as
(R(q, o) = 1 - |o_B - p_B|)
if the output is coherent (probabilities sum to one and lie in (
Comments & Academic Discussion
Loading comments...
Leave a Comment