MOA: Multi-Objective Alignment for Role-Playing Agents
Role-playing agents (RPAs) must simultaneously master many conflicting skills – following multi-turn instructions, exhibiting domain knowledge, and adopting a consistent linguistic style. Existing work either relies on supervised fine-tuning (SFT) that over-fits surface cues and yields low diversity, or applies reinforcement learning (RL) that fails to learn multiple dimensions for comprehensive RPA optimization. We present MOA (Multi-Objective Alignment), a reinforcement-learning framework that enables multi-dimensional, fine-grained rubric optimization for general RPAs. MOA introduces a novel multi-objective optimization strategy that trains simultaneously on multiple fine-grained rubrics to boost optimization performance. Besides, to address the issues of model output diversity and quality, we have also employed thought-augmented rollout with off-policy guidance. Extensive experiments on challenging benchmarks such as PersonaGym and RoleMRC show that MOA enables an 8B model to match or even outperform strong baselines such as GPT-4o and Claude across numerous dimensions. This demonstrates the great potential of MOA in building RPAs that can simultaneously meet the demands of role knowledge, persona style, diverse scenarios, and complex multi-turn conversations.
💡 Research Summary
The paper “MOA: Multi-Objective Alignment for Role-Playing Agents” addresses a fundamental challenge in developing advanced Role-Playing Agents (RPAs). RPAs must excel in multiple, often conflicting dimensions simultaneously, such as following multi-turn instructions, demonstrating accurate domain knowledge, and maintaining a consistent persona-specific linguistic style. The authors identify limitations in prevailing methods: Supervised Fine-Tuning (SFT) tends to overfit superficial patterns and lacks output diversity, while standard Reinforcement Learning (RL) approaches, designed for tasks with single verifiable rewards, fail to optimize the multifaceted nature of role-playing.
To overcome these issues, the authors propose MOA (Multi-Objective Alignment), a novel RL framework specifically tailored for comprehensive RPA optimization. MOA’s innovation lies in two core components:
-
Multi-Objective Optimization Strategy: Instead of collapsing multiple evaluation aspects into a single scalar reward, MOA operates directly on a vector of fine-grained rubric scores (e.g., for knowledge, style, engagement). Its key algorithmic contributions are:
- Pivot Dimension Selection: MOA dynamically identifies the “pivot” dimension—the reward aspect showing the strongest current improvement trend—by analyzing short-term reward histories. This focuses learning efforts on the most learnable objective at each step.
- Conflict Rollouts Elimination: For the chosen pivot dimension, MOA filters out “conflicting” rollout samples that score poorly on the pivot but well on other dimensions. This prevents these high-overall-score but pivot-poor samples from being treated as positive examples during policy update, reducing learning noise and enabling more stable optimization of each individual dimension. This is a significant departure from simple weighted-sum methods like GRPO.
-
Diversified Rollout Strategy: To combat the low sample diversity and quality typical of SFT-tuned policies, MOA employs:
- Thought-Augmented Rollout: The policy model is prompted to generate an internal reasoning trace (a “thought”) reflecting the persona’s emotions, background, and goals before producing the final response. This simple prompting technique was shown to improve performance even in closed-source models like Claude-3.7.
- Off-Policy Guidance: Outputs from a stronger, off-policy model (e.g., GPT-4) are mixed into the rollout pool used for advantage estimation. This elevates the quality and diversity of training samples and helps stabilize training against reward hacking.
The paper provides a theoretical sketch showing that the pivot selection mechanism yields greater expected reward improvement per update compared to uniform weighting.
Extensive experiments on the challenging PersonaGym and RoleMRC benchmarks validate MOA’s effectiveness. It consistently outperforms both SFT and standard RL baselines across all evaluation metrics and model sizes (1.7B to 8B parameters). The most striking result is that MOA, applied to an 8B parameter model, achieves performance comparable to or even surpassing powerful proprietary baselines like GPT-4o and Claude on PersonaGym, and outperforms GPT-4o by 21.0% on RoleMRC. This demonstrates MOA’s potential as a scalable framework for building capable RPAs without relying on excessively large model scales or labor-intensive data curation pipelines.
In summary, MOA presents a principled RL solution to the multi-dimensional optimization problem inherent in role-playing, moving beyond the limitations of SFT and single-objective RL by introducing dynamic objective selection and conflict-aware sample filtering, paired with techniques to ensure high-quality, diverse exploration.
Comments & Academic Discussion
Loading comments...
Leave a Comment