Can LLMs Truly Embody Human Personality? Analyzing AI and Human Behavior Alignment in Dispute Resolution

Can LLMs Truly Embody Human Personality? Analyzing AI and Human Behavior Alignment in Dispute Resolution
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) are increasingly used to simulate human behavior in social settings such as legal mediation, negotiation, and dispute resolution. However, it remains unclear whether these simulations reproduce the personality-behavior patterns observed in humans. Human personality, for instance, shapes how individuals navigate social interactions, including strategic choices and behaviors in emotionally charged interactions. This raises the question: Can LLMs, when prompted with personality traits, reproduce personality-driven differences in human conflict behavior? To explore this, we introduce an evaluation framework that enables direct comparison of human-human and LLM-LLM behaviors in dispute resolution dialogues with respect to Big Five Inventory (BFI) personality traits. This framework provides a set of interpretable metrics related to strategic behavior and conflict outcomes. We additionally contribute a novel dataset creation methodology for LLM dispute resolution dialogues with matched scenarios and personality traits with respect to human conversations. Finally, we demonstrate the use of our evaluation framework with three contemporary closed-source LLMs and show significant divergences in how personality manifests in conflict across different LLMs compared to human data, challenging the assumption that personality-prompted agents can serve as reliable behavioral proxies in socially impactful applications. Our work highlights the need for psychological grounding and validation in AI simulations before real-world use.


💡 Research Summary

The paper investigates whether large language models (LLMs) can faithfully reproduce human personality‑driven behavior in dispute‑resolution dialogues. Recognizing the growing deployment of LLMs in high‑stakes interpersonal contexts such as mediation, negotiation, and legal coaching, the authors ask whether prompting an LLM with a Big Five Inventory (BFI) profile yields behavior that aligns with the same personality traits in real humans.

To answer this, they construct a rigorous evaluation framework that compares human‑human (H‑H) and LLM‑LLM (L‑L) interactions on identical negotiation scenarios and matched personality profiles. The human data come from the KOBE DISpute (KODIS) corpus, a crowd‑sourced set of 248 extended buyer‑seller dispute dialogues, each accompanied by complete BFI responses for both participants. For the L‑L side, they generate a parallel dataset (L2L) by prompting three contemporary closed‑source LLMs—OpenAI GPT‑4o mini, Anthropic Claude Sonnet 3.7, and Google Gemini Flash—with the same scenario text and a BFI‑derived personality prompt. Personality prompts are built from 70 bipolar adjectives, three per trait, calibrated to high, medium, or low intensity, following the validated method of Huang & Hadfi (2024). All models run with default temperature = 1 to test zero‑shot capability.

Behavioral measurement is split into two categories. (1) Final outcomes: (a) negotiation score (the inner product of agreed allocations and issue preferences), (b) acceptance of the final offer, and (c) “not walk‑away” (whether the participant stayed in the negotiation). (2) Strategic behaviors: Using the Interests‑Rights‑Power (IRP) framework, each utterance is labeled as cooperative, competitive, neutral, rights‑based, power‑based, or residual. From these labels the authors compute four quantitative metrics: IRP Ratio (frequency of competitive vs. cooperative moves), IRP Reciprocity (how often a speaker mirrors the partner’s preceding strategy), Escalation Ratio (competitive responses to non‑competitive moves), and De‑escalation Ratio (non‑competitive responses to competitive moves). These metrics capture fine‑grained dynamics such as mirroring, tension escalation, and conflict resolution tactics.

Statistical analysis reveals stark divergences between humans and LLMs. In the human sample, neuroticism emerges as the strongest predictor of lower scores, higher escalation, and greater likelihood of walking away (β ≈ ‑0.38, p < .01). Extraversion and agreeableness modestly increase cooperative moves and acceptance rates. By contrast, the LLMs show a different pattern: extraversion and agreeableness dominate the predictive landscape, while neuroticism has negligible effect. Claude and Gemini align more closely with human patterns (IRP Reciprocity ≈ 0.42 and 0.39 respectively), whereas GPT‑4o mini displays the weakest alignment (≈ 0.21) and the highest escalation ratio (≈ 0.31). Regression and structural equation modeling confirm that for humans the pathway “neuroticism → escalation → low acceptance” is significant, while for LLMs the dominant pathway is “extraversion → cooperative proposals → high acceptance.”

These findings challenge the common assumption that personality‑prompted LLMs can serve as psychologically accurate proxies in socially impactful applications. The authors argue that current prompting techniques do not endow LLMs with the nuanced emotional instability and conflict‑avoidance mechanisms that characterize human neuroticism, and that model architecture and pre‑training data biases likely shape how personality traits are expressed.

The paper contributes (1) a publicly released evaluation framework, (2) the L2L dataset with matched personality profiles, and (3) a set of interpretable metrics for future work. It calls for systematic psychological grounding, behavioral validation, and ethical oversight before deploying LLMs in mediation, negotiation, or other high‑risk interpersonal domains. The work thus provides a concrete methodological foundation for assessing human‑aligned, socially responsible AI.


Comments & Academic Discussion

Loading comments...

Leave a Comment