Behavioral Economics of AI: LLM Biases and Corrections

Behavioral Economics of AI: LLM Biases and Corrections
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Do generative AI models, particularly large language models (LLMs), exhibit systematic behavioral biases in economic and financial decisions? If so, how can these biases be mitigated? Drawing on the cognitive psychology and experimental economics literatures, we conduct the most comprehensive set of experiments to date$-$originally designed to document human biases$-$on prominent LLM families across model versions and scales. We document systematic patterns in LLM behavior. In preference-based tasks, responses become more human-like as models become more advanced or larger, while in belief-based tasks, advanced large-scale models frequently generate rational responses. Prompting LLMs to make rational decisions reduces biases.


💡 Research Summary

The paper “Behavioral Economics of AI: LLM Biases and Corrections” introduces a new research agenda that treats large language models (LLMs) as economic agents and asks whether they exhibit systematic behavioral biases similar to those documented in humans. Drawing on classic cognitive‑psychology experiments (Ellsberg, Kahneman‑Tversky) and recent experimental‑economics studies (Afrouzi et al., Bose et al.), the authors construct a comprehensive battery of preference‑based and belief‑based tasks. They then query four major LLM families—OpenAI’s ChatGPT, Anthropic’s Claude, Google’s Gemini, and Meta’s Llama—using both an older and a newer version of each family, and for the newer version they compare a large‑parameter model with a smaller‑parameter counterpart. This yields a 2 × 2 design (time‑series version, cross‑sectional scale) across twelve models.

The empirical findings are organized into five observations. First, on preference‑based questions (e.g., probability weighting, loss aversion, framing), responses become increasingly “human‑like” as models advance or grow in size, meaning they replicate well‑known irrational patterns such as prospect‑theory weighting. For instance, Claude 3 Opus matches human responses on four of six such items, while its smaller sibling Claude 3 Haiku matches three, and the older Claude 2 only one. Second, on belief‑based questions (e.g., Bayesian updating, autoregressive forecasting), the opposite trend appears: larger, more advanced models produce more rational, statistically correct answers. Gemini 1.5 Pro answers all ten belief items correctly, whereas its smaller Flash version gets five and the legacy Gemini 1.0 Pro only two. Third, heterogeneity across families emerges: Gemini tends to be more human‑like on preference tasks but less rational on belief tasks compared with ChatGPT; Llama is less rational on belief tasks than GPT but similar on preferences. Fourth, when the authors replicate two experimental‑economics paradigms—forecasting an AR process (Afrouzi et al.) and portfolio allocation based on visual price trajectories (Bose et al.)—they find that small‑scale advanced models (GPT‑4o, Claude 3 Haiku, Gemini 1.5 Flash) generate human‑like, over‑persistent forecasts, whereas their larger counterparts produce forecasts aligned with the true persistence. In the allocation task, large‑scale models (GPT‑4, Claude 3 Opus, Gemini 1.5 Pro) display stronger dependence on visual salience, again mirroring human bias. Fifth, the authors test several debiasing interventions. A brief “role‑priming” prompt that asks the model to act as a rational investor using Expected Utility consistently improves rationality on both preference and belief items, though the magnitude is modest. The effect operates through altered confidence levels and a shift from “type A” (intuitive) to “type B” (analytical) reasoning. More elaborate prompts that combine role‑priming with additional information do not yield further gains.

The paper concludes with two conjectures about the mechanisms behind the observed patterns. The first posits that the increasing human‑like irrationality on preference tasks stems from Reinforcement Learning from Human Feedback (RLHF), which aligns model outputs with human preferences recorded in the feedback data. The second suggests that larger models’ superior performance on belief tasks derives from their broader training corpora and greater computational capacity, enabling them to capture statistical regularities more accurately. The authors argue that testing these conjectures can guide future model architecture and training‑procedure choices.

Overall, the study provides the first large‑scale, systematic benchmark of behavioral biases in LLMs, demonstrates that bias direction depends on task type, model version, and scale, and shows that simple prompt‑based debiasing can modestly improve rationality. The findings have implications for using LLMs as research tools in economics, for deploying them in financial decision‑making contexts, and for policy discussions about AI safety and fairness.


Comments & Academic Discussion

Loading comments...

Leave a Comment