Stronger-MAS: Multi-Agent Reinforcement Learning for Collaborative LLMs

Stronger-MAS: Multi-Agent Reinforcement Learning for Collaborative LLMs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multi-agent systems (MAS) and reinforcement learning (RL) are widely used to enhance the agentic capabilities of large language models (LLMs). MAS improves task performance through role-based orchestration, while RL uses environmental rewards to learn stronger policies, such as GRPO-style optimization. However, applying on-policy RL to MAS remains underexplored and presents unique challenges. Algorithmically, standard GRPO grouping assumptions break down because prompts vary by role and by turn. System-wise, the training stack must support MAS-workflow rollouts and on-policy updates for both single-policy and multi-policy models. We propose AT-GRPO, which includes (i) an agent- and turn-wise grouped RL algorithm tailored to MAS and (ii) a training system that supports both single- and multi-policy regimes. Across game, planning, coding, and math tasks, AT-GRPO delivers substantial gains. On long-horizon planning, it increases accuracy from a 14.0 to 47.0 percent single-agent RL baseline to 96.0 to 99.5 percent. It also improves reasoning performance, with average gains of 3.87 to 7.62 percent on coding tasks and 9.0 to 17.93 percent on math. Code and environments are available at: https://github.com/pettingllms-ai/PettingLLMs.


💡 Research Summary

The paper tackles the under‑explored problem of applying on‑policy reinforcement learning (RL) to multi‑agent systems (MAS) built on large language models (LLMs). While both MAS (which orchestrates role‑based collaboration) and RL (which improves policies via environmental rewards) have individually boosted LLM agent performance, their combination faces two major obstacles. First, standard group‑relative policy optimization (GRPO) assumes that multiple sampled actions share the same prompt, an assumption that breaks in MAS because prompts differ across roles and turns. Second, existing RL pipelines are designed for a single model, making it difficult to roll out, coordinate, and update several policies simultaneously.

To address these issues the authors introduce AT‑GRPO (Agent‑Turn Grouped Policy Optimization) together with a training system that can handle both single‑policy (role‑sharing) and multi‑policy (role‑specialized) regimes. The core algorithmic innovations are:

  1. Agent‑ and Turn‑wise Grouping – For each agent i at turn t, K candidate actions are sampled from the current policy. Because all K candidates share the exact same role‑specific prompt and interaction history, they form a valid comparison group. The relative advantage for each candidate is computed by mean‑centering and normalizing the reward within this group, exactly as in GRPO, preserving its variance‑reduction benefits.

  2. Tree‑Structured Sampling – After computing advantages, the candidate with the highest reward is selected as the actual action a*_i,t. This action updates the environment state, and the process repeats at the next turn, creating a branching “tree” of trajectories. The tree ensures that at every turn each agent still has a full group of K comparable candidates, avoiding the degenerate case of group size = 1 that occurs with naïve parallel sampling.

  3. Multi‑Policy Support – The system maintains a mapping σ from agents to LLM instances. In the role‑sharing regime (M = 1) all agents share a single policy θ₁, and a single joint minibatch is formed from the union of all agents’ data. In the role‑specialized regime (M = N) each agent i has its own policy θ_i, and updates are performed independently on per‑agent batches. This flexibility lets practitioners decide, based on task characteristics, whether specialization or sharing yields better performance.

The training loop consists of two phases. In the rollout phase, E environments are instantiated in parallel; for each turn the algorithm samples K actions per agent, computes rewards, builds group keys (environment, agent, turn), calculates advantages, stores the tuple (observation, actions, advantages) in per‑agent datasets D_i, and executes the best action. In the update phase, for each model m the corresponding batch B_m is assembled, the PPO‑style clipped loss (Eq. 2) is evaluated using the stored advantages, and θ(m) is updated via gradient descent.

Experimental Setup – The authors evaluate AT‑GRPO on Qwen‑3 1.7 B and 8 B models across four domains: (i) long‑horizon planning (e.g., Sokoban), (ii) coding (LiveCodeBench), (iii) mathematical reasoning, and (iv) game‑style simulations. Baselines include single‑agent GRPO, prompt‑only MAS (both role‑sharing and role‑specialized), and the latest MAS‑RL baseline MARFT v3.

Results – AT‑GRPO delivers dramatic gains. In long‑horizon planning, accuracy jumps from a 14–47 % range for single‑agent GRPO to 96–99.5 % with AT‑GRPO. In coding, average improvements of 3.87–7.62 % points are observed, and in math tasks gains of 9.0–17.93 % points are reported. Notably, on the Sokoban benchmark with the 8 B model, AT‑GRPO outperforms the baseline by 84 % absolute accuracy. The analysis shows that role‑specialized policies excel in workflows that involve tight coder‑tester loops, whereas role‑sharing policies are sufficient for simpler planning tasks.

Ablation and Analysis – The paper demonstrates that the agent‑turn grouping dramatically reduces advantage variance, leading to stable learning curves. The tree‑sampling strategy is essential; without it, groups shrink to size 1 after the first turn, causing unstable updates. Credit assignment is handled via reward masks that give reward only to response tokens, ensuring fair comparison across candidates.

Limitations – Experiments are limited to relatively small LLMs and a handful of domains; scalability to very large models or complex physical simulators remains untested. Tree‑sampling incurs O(K × T) computation, which can become costly for large K; more efficient candidate generation (e.g., meta‑learning or adaptive K) is an open direction. Reward design is handcrafted per domain, so generalizing to a universal reward function is not yet demonstrated.

Future Work – The authors suggest extending AT‑GRPO to multimodal, larger‑scale models, integrating adaptive sampling methods to reduce computational overhead, and developing automated reward‑shaping frameworks that can be plugged into the same training pipeline.

Conclusion – AT‑GRPO provides a principled, practical solution for on‑policy RL in multi‑agent LLM settings. By ensuring that each agent‑turn pair has a well‑defined comparison group and by supporting both shared and specialized policies, the method unlocks synergistic gains that single‑agent RL cannot achieve. The substantial empirical improvements across planning, coding, and math tasks indicate that AT‑GRPO could become a foundational technique for training collaborative LLM agents in increasingly complex real‑world applications.


Comments & Academic Discussion

Loading comments...

Leave a Comment