HumanLLM: Benchmarking and Improving LLM Anthropomorphism via Human Cognitive Patterns
Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning and generation, serving as the foundation for advanced persona simulation and Role-Playing Language Agents (RPLAs). However, achieving authentic alignment with human cognitive and behavioral patterns remains a critical challenge for these agents. We present HumanLLM, a framework treating psychological patterns as interacting causal forces. We construct 244 patterns from ~12,000 academic papers and synthesize 11,359 scenarios where 2-5 patterns reinforce, conflict, or modulate each other, with multi-turn conversations expressing inner thoughts, actions, and dialogue. Our dual-level checklists evaluate both individual pattern fidelity and emergent multi-pattern dynamics, achieving strong human alignment (r=0.91) while revealing that holistic metrics conflate simulation accuracy with social desirability. HumanLLM-8B outperforms Qwen3-32B on multi-pattern dynamics despite 4x fewer parameters, demonstrating that authentic anthropomorphism requires cognitive modeling–simulating not just what humans do, but the psychological processes generating those behaviors.
💡 Research Summary
HumanLLM introduces a novel framework for advancing the anthropomorphism of large language models (LLMs) by grounding them in human cognitive and social‑cognitive patterns rather than treating personality traits as isolated labels. The authors first compile a taxonomy of 244 psychological patterns—100 personality traits derived from the Big Five (Goldberg, 1992) and 144 situationally triggered social‑cognitive mechanisms (e.g., spotlight effect, conformity, self‑serving bias). Each pattern is built from a systematic literature review of roughly 50 peer‑reviewed papers, yielding a corpus of about 12 000 articles. Using Gemini 2.5 Pro, the team extracts a structured representation for every pattern consisting of a precise definition, core mechanisms, and real‑world manifestations, thereby ensuring construct validity.
Next, the authors generate 11 359 multi‑character scenarios. Each scenario contains 2–6 characters and embeds 2–5 patterns that may reinforce, conflict, or conditionally modulate each other. Scenario creation leverages the DIAMONDS model to vary situational dimensions (e.g., goals, interpersonal dynamics, cultural context) and ensures that pattern combinations are semantically coherent. Character profiles encode both self‑perception (identity, traits, motivations) and other‑perception (beliefs about fellow characters), enabling realistic information asymmetry.
Multi‑turn dialogues (12–20 turns) are then synthesized with Claude Sonnet 4.5. Every turn is tripartite: inner thoughts (brackets), physical actions (parentheses), and spoken utterances. This design forces the model to express the target patterns across cognitive, affective, and behavioral layers, mimicking how humans translate internal states into observable behavior.
Evaluation is performed with a dual‑level checklist system. At the pattern level, 15 behavioral indicators per pattern assess fidelity to the underlying psychological construct. At the scenario level, 2–6 items per character capture expected behaviors given the specific pattern combination and situational context. Human judges and an LLM‑based judge score model outputs against these checklists. The resulting human‑alignment correlation reaches r = 0.91, indicating that the checklist scores closely mirror human judgments. Importantly, the authors demonstrate that holistic metrics (e.g., overall BLEU or accuracy) conflate simulation fidelity with social desirability, whereas the dual‑level approach isolates genuine psychological alignment.
Performance experiments compare HumanLLM‑8B (8 billion parameters) with Qwen3‑32B (32 billion parameters). Despite having a quarter of the parameters, HumanLLM‑8B outperforms Qwen3‑32B on multi‑pattern dynamics, especially in conflict scenarios where it correctly balances reinforcement (e.g., overconfidence + self‑serving bias) against suppression (e.g., talkativeness inhibited by spotlight effect). This suggests that explicit exposure to interacting patterns during fine‑tuning yields more human‑like reasoning than sheer model scale.
The paper acknowledges limitations: reliance on LLM‑generated scenarios may inherit model biases; the pattern set, while extensive, still caps interactions at five patterns per scenario; and human evaluation remains costly. Future directions include expanding to richer pattern networks, real‑time interactive testing, and reinforcement‑learning‑based reward modeling that directly optimizes for psychological alignment.
In sum, HumanLLM provides a comprehensive, psychologically grounded dataset and evaluation methodology that pushes LLMs toward authentic anthropomorphism. By treating cognitive patterns as causal forces and training models on their dynamic interplay, the work demonstrates that simulating “why” humans behave as they do is essential for building truly human‑like conversational agents.
Comments & Academic Discussion
Loading comments...
Leave a Comment