CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards

CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Self-evolution is a central research topic in enabling large language model (LLM)-based agents to continually improve their capabilities after pretraining. Recent research has witnessed a transition from reinforcement learning (RL)-free to RL-based methods. Current RL-based methods either rely on dense external reward signals or extract intrinsic reward signals from LLMs themselves. However, these approaches diverge from the self-evolution mechanisms observed in human intelligence, where individuals learn and improve through mutual discussion and collaboration. In this work, we introduce Co-Evolving Multi-Agent Systems (CoMAS), a novel framework that enables agents to improve autonomously by learning from inter-agent interactions without external supervision. CoMAS generates intrinsic rewards from rich discussion dynamics, employs an LLM-as-a-judge mechanism to formulate these rewards, and optimizes each agent’s policy through RL, thereby enabling decentralized and scalable co-evolution. Experimental results demonstrate that CoMAS consistently outperforms untrained agents and achieves state-of-the-art performance across most evaluation settings. Ablation studies confirm the necessity of interaction-based reward signals and reveal promising scalability as the number and diversity of agents increase. These findings establish CoMAS as a novel and effective paradigm for self-evolution in LLM-based agents.


💡 Research Summary

CoMAS (Co‑Evolving Multi‑Agent Systems) introduces a novel self‑evolution framework for large language model (LLM) based agents that eliminates the need for any external supervision. The core idea is to let agents learn from the rich dynamics of their own multi‑agent interactions. Each interaction follows a three‑step cycle: (1) a randomly selected agent proposes a solution to a given question, (2) another agent provides a critical evaluation of that solution, and (3) a third “judge” agent scores the pair using a predefined format. The scoring output, constrained to an integer between 1 and 3, is parsed by an LLM‑as‑a‑judge module and then normalized to produce intrinsic rewards. A zero‑sum reward structure is imposed: the solver receives r(s) = (τ̂ − 1)/2 while the evaluator receives r(e) = (3 − τ̂)/2, where τ̂ is the extracted score. This encourages both correctness and critical thinking. The judge agent is penalized (‑1 reward) for malformed output, ensuring stable formatting.

Each agent maintains its own policy πθ, parameterized independently, allowing heterogeneous LLM backbones to coexist. Policies generate token‑level outputs in an autoregressive fashion and are updated via standard policy‑gradient RL algorithms (e.g., PPO) using the interaction‑derived rewards. The selection of the acting agent is uniform, guaranteeing balanced experience collection across the pool. To keep context lengths manageable, the discussion history is truncated to the most recent κ rounds.

Experiments evaluate CoMAS across four interaction paradigms—Vanilla (single‑agent), Consistency (multiple agents answering the same question), AutoGen (agents generate and solve their own problems), and Debate (competitive discussion). Benchmarks span mathematics, coding, and general‑knowledge tasks. Across all settings, CoMAS outperforms untrained baselines by absolute margins ranging from 2.2 % to 19.8 %, with the largest gains observed in the Debate configuration. Compared to prior RL‑VR approaches that rely on rule‑based verifiers or specialized reward models, CoMAS shows superior stability and avoids reward‑hacking issues because the reward signal is intrinsically tied to the interaction outcome.

Ablation studies reveal that removing the interaction‑based reward collapses training, confirming the necessity of the zero‑sum design. Scaling experiments demonstrate near‑linear performance improvements as the number of agents grows from 2 to 8, and heterogeneous mixtures of models (e.g., GPT‑3.5 with LLaMA‑2) still benefit from the shared learning signal.

Limitations include dependence on the reliability of the LLM‑as‑judge component, the fixed three‑level scoring scheme which may be insufficient for complex multi‑choice problems, and the focus on text‑only tasks. Future work is suggested in three directions: (i) developing more robust and calibrated judge models, (ii) extending the reward taxonomy to capture confidence, creativity, or novelty, and (iii) applying the framework to multimodal or embodied environments where long‑term self‑directed learning is required.

In summary, CoMAS demonstrates that autonomous agents can achieve self‑evolution by extracting intrinsic rewards from peer‑to‑peer discussions, mirroring the collaborative learning processes observed in human societies. This decentralized, scalable approach opens a promising avenue for continual improvement of LLM‑based agents without external reward engineering.


Comments & Academic Discussion

Loading comments...

Leave a Comment