Eliciting Trustworthiness Priors of Large Language Models via Economic Games
One critical aspect of building human-centered, trustworthy artificial intelligence (AI) systems is maintaining calibrated trust: appropriate reliance on AI systems outperforms both overtrust (e.g., automation bias) and undertrust (e.g., disuse). A fundamental challenge, however, is how to characterize the level of trust exhibited by an AI system itself. Here, we propose a novel elicitation method based on iterated in-context learning (Zhu and Griffiths, 2024a) and apply it to elicit trustworthiness priors using the Trust Game from behavioral game theory. The Trust Game is particularly well suited for this purpose because it operationalizes trust as voluntary exposure to risk based on beliefs about another agent, rather than self-reported attitudes. Using our method, we elicit trustworthiness priors from several leading large language models (LLMs) and find that GPT-4.1’s trustworthiness priors closely track those observed in humans. Building on this result, we further examine how GPT-4.1 responds to different player personas in the Trust Game, providing an initial characterization of how such models differentiate trust across agent characteristics. Finally, we show that variation in elicited trustworthiness can be well predicted by a stereotype-based model grounded in perceived warmth and competence.
💡 Research Summary
This paper addresses a fundamental challenge in human-AI interaction: how to characterize the level of trust exhibited by an AI system itself. The authors argue that for safe and effective collaboration, trust in AI must be calibrated—avoiding both overtrust and undertrust. To this end, they introduce a novel methodological framework for eliciting the implicit “trustworthiness priors” of Large Language Models (LLMs), which represent their baseline expectations about reciprocity in social exchange.
The core innovation lies in the fusion of two established concepts. First, the Trust Game from behavioral economics is used as a formal, behavioral measure of trust, defined as voluntary exposure to risk based on beliefs about another agent. Second, the authors adapt the “iterated in-context learning” paradigm to this game. Within a Bayesian framework, the Trustee’s return behavior is modeled using a Beta-Binomial distribution, where the key parameter is the return ratio (r). The iterative process works as follows: an LLM is shown a small batch of five past Trust Game interactions. It uses this context to predict the return ratio for a new interaction. This predicted value is then used to stochastically generate a new batch of five game outcomes, which become the context for the next iteration. Crucially, this information bottleneck causes the influence of the initial data to decay, allowing the chain of predictions to converge to the model’s inherent prior distribution over trustworthiness, independent of the starting point.
In Experiment 1, this method was applied to 20 state-of-the-art LLMs, including proprietary models (GPT-4.1, GPT-5.2, Claude-3.5, Gemini-2.5) and open-weight models (Llama, Qwen series). The elicited trustworthiness distributions were compared to a human baseline derived from a meta-analysis of human Trust Game studies. The results revealed significant variation across models. GPT-4.1’s prior distribution most closely matched the human distribution, as measured by the lowest Kullback-Leibler divergence (0.130). Other models displayed different biases, such as a strong preference for a 50/50 split or expectations of extremely high returns.
Building on the finding that GPT-4.1’s priors are most human-like, Experiment 2 investigated how these priors shift in response to different Player A (Trustor) personas. The researchers created 100 personas varying in profession, gender, and moral character. For each persona, they elicited GPT-4.1’s expected return ratio. The model showed systematic discrimination, expecting high returns from personas like “nurse” or “teacher” and very low returns from “criminal” or “scammer.” To explain this variance, the authors drew upon the Stereotype Content Model from social psychology. They had GPT-4.1 itself rate each persona on the dimensions of “warmth” and “competence.” A linear regression model using these two scores as predictors was able to explain 85% of the variance in the elicited trustworthiness, with “warmth” being a stronger predictor than “competence.” This demonstrates that the LLM’s social judgments are structured in a way remarkably similar to human stereotype-based reasoning.
The study makes several key contributions. Methodologically, it provides a robust, game-theoretic tool for quantifying the social priors of AI systems, moving beyond self-report measures. Empirically, it offers a comprehensive benchmark of trustworthiness biases across a wide range of modern LLMs, identifying GPT-4.1 as particularly aligned with human norms. Theoretically, it shows that advanced LLMs not only learn statistical patterns of human behavior but also internalize structured social cognitive frameworks, such as stereotype dimensions, which guide their context-dependent trust judgments. These findings have implications for using LLMs as simulators of human social behavior and highlight the critical need to audit and understand the social biases embedded within these models to ensure their trustworthy deployment.
Comments & Academic Discussion
Loading comments...
Leave a Comment