Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems

Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multi-agent LLM systems enable advanced reasoning and tool use via role specialization, yet reliable reinforcement learning (RL) post-training for such systems remains difficult. In this work, we theoretically pinpoint a key reason for training instability when extending group-based RL to multi-agent LLM systems. We show that under GRPO-style optimization, a global normalization baseline may deviate from diverse agents’ reward distributions, which ultimately leads to gradient-norm instability. Based on this finding, we propose Dr. MAS, a simple and stable RL training recipe for multi-agent LLM systems. Dr. MAS uses an agent-wise remedy: normalizing advantages per agent using each agent’s own reward statistics, which calibrates gradient scales and dramatically stabilizes training, both theoretically and empirically. Beyond the algorithm, Dr. MAS provides an end-to-end RL training framework for multi-agent LLM systems, supporting scalable orchestration, flexible per-agent LLM serving and optimization configs, and shared resource scheduling of LLM actor backends. We evaluate Dr. MAS on multi-agent math reasoning and multi-turn search benchmarks using Qwen2.5 and Qwen3 series models. Dr. MAS achieves clear gains over vanilla GRPO (e.g., +5.6% avg@16 and +4.6% pass@16 on math, and +15.2% avg@16 and +13.1% pass@16 on search) while largely eliminating gradient spikes. Moreover, it remains highly effective under heterogeneous agent-model assignments while improving efficiency.


💡 Research Summary

The paper tackles a pressing challenge in the emerging field of multi‑agent large language model (LLM) systems: how to apply reinforcement‑learning (RL) post‑training in a stable manner when multiple specialized agents cooperate on a single task. While single‑agent RL methods such as Group Relative Policy Optimization (GRPO) have shown strong performance, extending them to a multi‑agent setting introduces severe instability. The authors first formalize the multi‑agent LLM scenario, where K agents πθk interact sequentially, each generating text conditioned on a shared state. A trajectory τ consists of (state, action, active‑agent) tuples and yields a scalar terminal reward R(τ).

GRPO aggregates N rollouts for the same prompt, computes a global mean µ and standard deviation σ of the rewards, and normalizes each trajectory’s advantage as (R‑µ)/σ. This global baseline is then applied uniformly to all agents, regardless of how often each agent is invoked or what reward distribution it experiences. The paper demonstrates, both theoretically and empirically, that this mismatch is the root cause of gradient‑norm explosion.

Through a rigorous analysis, the authors derive Lemma 4.2, which decomposes the second moment of an agent’s gradient into a dominant scaling factor (σk²+(µk‑µ)²)/σ² multiplied by the expected squared score‑function norm, plus a smaller covariance correction Δk. Here µk and σk² are the mean and variance of rewards observed when agent k is active. When an agent’s reward distribution is far from the global statistics—either because its mean deviates substantially from µ or its variance is much larger—the scaling factor inflates, causing the gradient norm to grow linearly and potentially diverge (Proposition 4.3). This phenomenon explains the frequent “gradient spikes” observed in naïve multi‑agent GRPO training.

To remedy the problem, the authors propose Dr. MAS, a simple yet theoretically grounded modification: each agent computes its own advantage using its own reward statistics, i.e., (R‑µk)/σk. By normalizing per‑agent, the scaling factor is forced close to one, dramatically reducing gradient variance and eliminating spikes. Importantly, this change requires only a minor alteration to the existing GRPO pipeline and allows each agent to retain independent optimizer hyper‑parameters (learning rate, clipping epsilon, etc.).

Beyond the algorithm, the paper introduces a full‑stack RL training framework tailored for multi‑agent LLMs. An orchestrator manages distributed rollouts, mapping agents to worker groups that may share or separate LLM back‑ends (e.g., a 14‑B planner and a 7‑B executor). A shared resource pool schedules inference across GPUs, enabling efficient utilization even when agents have heterogeneous model sizes. The system supports per‑agent configuration files, dynamic role assignment, and flexible batching, addressing the practical bottlenecks of scaling RL to many agents.

Empirical evaluation uses Qwen2.5 and Qwen3 series models (7B and 14B) on two benchmark suites: (1) role‑specialized multi‑agent math reasoning, where separate agents handle problem understanding, step‑by‑step solution, and final answer synthesis; (2) multi‑turn search, where agents perform query analysis, document retrieval, and answer summarization. Dr. MAS consistently outperforms vanilla GRPO. On the math benchmark, average@16 improves by +5.6 % and pass@16 by +4.6 %; on the search benchmark, average@16 gains +15.2 % and pass@16 +13.1 %. Gradient‑norm plots show that the spikes present in GRPO are virtually eliminated under Dr. MAS. Moreover, the framework remains effective when agents are assigned heterogeneous models (e.g., a larger model for high‑level planning and a smaller one for low‑level execution), achieving higher overall throughput and lower GPU memory consumption.

In summary, the paper makes three key contributions: (1) a formal theoretical analysis pinpointing global advantage normalization as the cause of gradient‑norm inflation in multi‑agent RL; (2) the Dr. MAS algorithm, which stabilizes training via per‑agent advantage normalization with provable reduction in gradient variance; and (3) an end‑to‑end system that orchestrates scalable, resource‑aware RL for heterogeneous multi‑agent LLM deployments. The work not only advances the state of the art in RL‑based LLM fine‑tuning but also provides a practical foundation for future complex, collaborative AI systems such as autonomous software engineers, multi‑modal assistants, and large‑scale decision‑making agents.


Comments & Academic Discussion

Loading comments...

Leave a Comment