Multi-Agent Collaborative Reward Design for Enhancing Reasoning in Reinforcement Learning

We present Collaborative Reward Modeling (CRM), a framework that replaces a single black-box reward model with a coordinated team of specialist evaluators to improve robustness and interpretability in RLHF. Conventional reward models struggle to jointly optimize multiple, sometimes conflicting, preference dimensions (e.g., factuality, helpfulness, safety) and offer limited transparency into why a score is assigned. CRM addresses these issues by decomposing preference evaluation into domain-specific agents that each produce partial signals, alongside global evaluators such as rankerbased and embedding-similarity rewards. A centralized aggregator fuses these signals at each timestep, balancing factors like step-wise correctness, multi-agent agreement, and repetition penalties, yielding a single training reward compatible with standard RL pipelines. The policy is optimized with advantage-based updates (e.g., GAE), while a value model regresses to the aggregated reward, enabling multiperspective reward shaping without requiring additional human annotations beyond those used to train the evaluators. We evaluate CRM on RewardBench, a benchmark suite aligned with multi-dimensional preference evaluation, demonstrating a practical, modular path to more transparent reward modeling and more stable optimization.

📜 Original Paper Content