Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models
The tendency of users to anthropomorphise large language models (LLMs) is of growing interest to AI developers, researchers, and policy-makers. Here, we present a novel method for empirically evaluating anthropomorphic LLM behaviours in realistic and varied settings. Going beyond single-turn static benchmarks, we contribute three methodological advances in state-of-the-art (SOTA) LLM evaluation. First, we develop a multi-turn evaluation of 14 anthropomorphic behaviours. Second, we present a scalable, automated approach by employing simulations of user interactions. Third, we conduct an interactive, large-scale human subject study (N=1101) to validate that the model behaviours we measure predict real users’ anthropomorphic perceptions. We find that all SOTA LLMs evaluated exhibit similar behaviours, characterised by relationship-building (e.g., empathy and validation) and first-person pronoun use, and that the majority of behaviours only first occur after multiple turns. Our work lays an empirical foundation for investigating how design choices influence anthropomorphic model behaviours and for progressing the ethical debate on the desirability of these behaviours. It also showcases the necessity of multi-turn evaluations for complex social phenomena in human-AI interaction.
💡 Research Summary
The paper introduces AnthroBench, a novel, fully automated benchmark designed to measure anthropomorphic behaviors exhibited by large language models (LLMs) across multi‑turn conversations. Recognizing that users frequently attribute human‑like qualities to conversational AI—sometimes leading to over‑trust, privacy leakage, or undue influence—the authors argue that existing evaluation methods, which largely rely on single‑turn static prompts or adversarial red‑team setups, are insufficient for capturing the nuanced, emergent nature of anthropomorphism in realistic interactions.
AnthroBench operationalizes anthropomorphism as a set of 14 distinct textual behaviors, grouped into four categories: personhood claims, physical embodiment claims, expressions of internal states (self‑referential), and relationship‑building (relational). These behaviors are derived from prior literature and include items such as first‑person pronoun use, expressions of doubt or confidence, empathy statements, and claims of having a personal history.
Methodologically, the authors construct a multi‑turn evaluation pipeline consisting of three stages: design, automated evaluation, and validation. In the design stage, they hand‑craft 30 base prompts for each behavior category, yielding 120 base prompts. These are contextualized across four use domains—friendship (high empathy, low professionalism), life coaching (high empathy, high professionalism), career development (low empathy, high professionalism), and general planning (low empathy, low professionalism)—and two concrete scenarios per domain, resulting in 960 distinct initial user utterances.
For the automated evaluation, a “User LLM” (Gemini 1.5 Pro) is seeded with each of the 960 prompts and engages in a five‑turn dialogue with a target LLM (one of four state‑of‑the‑art models: Gemini 1.5 Pro, Claude 3.5 Sonnet, GPT‑4o, Mistral Large). This yields 4,800 messages per target model (19,200 total). To label anthropomorphic behaviors, three separate “Judge LLMs” (Gemini 1.5‑flash‑002, Claude 3.5‑sonnet, GPT‑4‑turbo) receive a definition and a few‑shot negative example for each behavior, then produce a binary decision plus a brief justification. Each message‑behavior pair is sampled three times per judge, resulting in 561,600 individual ratings; the final label is taken as the mode of the three samples for each judge.
The validation stage involves a large‑scale human subject experiment with 1,101 participants. Participants interact with either a “high‑anthropomorphic” model (selected based on the automated scores) or a “low‑anthropomorphic” model, then complete questionnaires measuring perceived agency, emotional capacity, and trust. Correlational analyses reveal strong positive relationships (r > 0.6) between the frequency of automatically detected behaviors and participants’ subjective anthropomorphism scores, confirming construct validity. Notably, over half of the detected behaviors first appear after turn 2, underscoring the importance of multi‑turn assessment; once a behavior emerges, subsequent turns are significantly more likely to contain additional anthropomorphic cues.
Key findings include: (1) All four evaluated LLMs display similar patterns, dominated by relationship‑building cues (e.g., empathy, validation) and frequent first‑person pronoun usage. (2) The prevalence of high‑risk behaviors (e.g., claims of internal states, strong empathy) varies by domain, being highest in friendship and life‑coaching contexts. (3) Multi‑turn dynamics matter: many behaviors are latent in early turns and only surface later, and the presence of one behavior increases the probability of others in subsequent turns.
The authors discuss several implications. From an engineering perspective, AnthroBench offers a quantitative metric that can be integrated into model development cycles to monitor and, if desired, attenuate anthropomorphic tendencies. From an ethical and policy standpoint, the work highlights that while relationship‑building can improve user experience, it may also foster parasocial attachment or over‑reliance, especially in vulnerable populations. Consequently, regulators and product teams should consider domain‑specific risk assessments when deploying conversational agents.
Limitations are acknowledged: the user simulation relies on a single LLM, potentially limiting diversity of user styles; the five‑turn horizon may miss longer‑term dynamics; and the focus on textual cues excludes multimodal signals such as voice tone or visual avatars. Future work is suggested to incorporate varied user personas, longer dialogues, and multimodal interaction data.
In sum, AnthroBench represents the first comprehensive, automated, multi‑turn benchmark for anthropomorphic behavior in LLMs, validated against human perception. It provides a practical tool for researchers, developers, and policymakers to systematically assess and manage the social impact of increasingly human‑like AI conversational agents.
Comments & Academic Discussion
Loading comments...
Leave a Comment