One Model, All Roles: Multi-Turn, Multi-Agent Self-Play Reinforcement Learning for Conversational Social Intelligence

One Model, All Roles: Multi-Turn, Multi-Agent Self-Play Reinforcement Learning for Conversational Social Intelligence
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper introduces OMAR: One Model, All Roles, a reinforcement learning framework that enables AI to develop social intelligence through multi-turn, multi-agent conversational self-play. Unlike traditional paradigms that rely on static, single-turn optimizations, OMAR allows a single model to role-play all participants in a conversation simultaneously, learning to achieve long-term goals and complex social norms directly from dynamic social interaction. To ensure training stability across long dialogues, we implement a hierarchical advantage estimation that calculates turn-level and token-level advantages. Evaluations in the SOTOPIA social environment and Werewolf strategy games show that our trained models develop fine-grained, emergent social intelligence, such as empathy, persuasion, and compromise seeking, demonstrating the effectiveness of learning collaboration even under competitive scenarios. While we identify practical challenges like reward hacking, our results show that rich social intelligence can emerge without human supervision. We hope this work incentivizes further research on AI social intelligence in group conversations.


💡 Research Summary

The paper introduces OMAR (One Model, All Roles), a reinforcement‑learning framework that enables a single large language model (LLM) to play every participant in a multi‑turn, multi‑agent conversation simultaneously. Traditional RL‑VR approaches such as PPO or GRPO are designed for single‑turn, single‑agent tasks and therefore struggle with the open‑ended, high‑dimensional action space of dialogue. OMAR reinterprets the n independent rollouts of GRPO as the n agents in a conversation. At each turn the model receives a concatenated prompt consisting of the shared conversation history and a persona‑specific description for each role, and it generates n utterances in parallel. The conversation history for the next turn is formed by aggregating all utterances, allowing the model to self‑play against copies of itself in a fully differentiable loop.

Training uses an episode‑level scalar reward that reflects social outcomes (goal completion, relationship quality, rule compliance, financial benefit, etc.). Because the reward is only observed at the end of a dialogue, naïve PPO would back‑propagate this signal through a very long token sequence, leading to high‑variance advantage estimates. To address this, the authors propose Hierarchical Advantage Estimation. First, each turn is treated as a macro‑step; the value of the last token in the turn is used with Generalized Advantage Estimation (GAE) to compute a turn‑level advantage. Second, this turn‑level advantage is treated as a pseudo‑reward for that turn, and standard token‑level value functions plus GAE are applied within the turn to obtain token‑level advantages. These token‑level advantages are finally used in the PPO loss, dramatically reducing variance while preserving long‑horizon credit assignment.

Experiments are conducted in two domains. (1) SOTOPIA, a social‑interaction simulator where two agents negotiate with opposing goals (e.g., buyer vs. seller). The authors start from the Qwen‑2.5‑7B model, fine‑tune it on a small set of human‑written seed dialogues, and then train with OMAR on ~3,200 conversations (500 test). Evaluation is performed by a GPT‑5‑Chat “LLM‑as‑Judge” that scores seven criteria: goal completion, believability, knowledge, secret leakage, rule compliance, relationship, and financial benefit. (2) Werewolf, a classic hidden‑role game that mixes cooperation and deception. Baselines include the untrained Qwen‑2.5‑7B model and a prior SOTOPIA‑RL model that uses utterance‑level reward models.

Results show that OMAR consistently outperforms both baselines across all SOTOPIA metrics. Improvements are especially pronounced in rule compliance and relationship scores, indicating that the model learns to maintain long‑term social contracts rather than merely optimizing immediate gains. In the Werewolf setting, the trained agents display emergent behaviors such as persuasion, strategic lying, and timely alliance formation, despite receiving only a win/loss reward at episode end. The paper also discusses reward‑hacking phenomena; the authors mitigate it with a turn‑level quality filter that discards implausibly high‑reward trajectories before PPO updates.

Key contributions are: (1) a unified single‑model architecture for multi‑agent conversational self‑play, (2) a hierarchical advantage estimator that stabilizes PPO over long dialogues, and (3) empirical evidence that rich social intelligence (empathy, compromise, negotiation) can emerge without explicit human supervision. The work opens avenues for scaling socially aware AI to larger groups, mixed human‑AI settings, and more sophisticated reward‑design automation.


Comments & Academic Discussion

Loading comments...

Leave a Comment