MAS-ProVe: Understanding the Process Verification of Multi-Agent Systems

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multi-Agent Systems (MAS) built on Large Language Models (LLMs) often exhibit high variance in their reasoning trajectories. Process verification, which evaluates intermediate steps in trajectories, has shown promise in general reasoning settings, and has been suggested as a potential tool for guiding coordination of MAS; however, its actual effectiveness in MAS remains unclear. To fill this gap, we present MAS-ProVe, a systematic empirical study of process verification for multi-agent systems (MAS). Our study spans three verification paradigms (LLM-as-a-Judge, reward models, and process reward models), evaluated across two levels of verification granularity (agent-level and iteration-level). We further examine five representative verifiers and four context management strategies, and conduct experiments over six diverse MAS frameworks on multiple reasoning benchmarks. We find that process-level verification does not consistently improve performance and frequently exhibits high variance, highlighting the difficulty of reliably evaluating partial multi-agent trajectories. Among the methods studied, LLM-as-a-Judge generally outperforms reward-based approaches, with trained judges surpassing general-purpose LLMs. We further observe a small performance gap between LLMs acting as judges and as single agents, and identify a context-length-performance trade-off in verification. Overall, our results suggest that effective and robust process verification for MAS remains an open challenge, requiring further advances beyond current paradigms. Code is available at https://github.com/Wang-ML-Lab/MAS-ProVe.

💡 Research Summary

The paper presents MAS‑ProVe, a comprehensive empirical study of process verification for large‑language‑model (LLM) based multi‑agent systems (MAS). Process verification evaluates intermediate steps of a reasoning trajectory, aiming to catch errors before they propagate to the final answer. The authors explore three verification paradigms—LLM‑as‑a‑Judge (generative), reward models (RM), and process reward models (PRM)—and two granularity levels—agent‑level (verifying each sub‑agent’s output) and iteration‑level (verifying the whole MAS state after each round). Five representative verifiers (general‑purpose LLM, fine‑tuned judge, Skywork‑Reward‑V2, Qwen2.5‑Math‑PRM, and a weaker GPT‑4o‑Mini) and four context‑management strategies (current step only, summarized context + current step, summarized full context, raw full context) are combined with six diverse MAS frameworks (Debate, AFlow, ADAS, DyLAN, MaAS, MAS‑Zero). Experiments cover two reasoning domains: mathematical problem solving (AIME‑24, AIME‑25) and real‑world assistant tasks (GAIA information‑extraction subset). All MAS use GPT‑5‑Mini as the backbone, and a greedy best‑first search with three candidate continuations per state is employed to let the verifier guide the next step.

Key findings: 1) Process verification does not uniformly improve MAS performance; variance is high, especially for reward‑based verifiers. 2) LLM‑as‑a‑Judge consistently outperforms scoring‑based approaches, with fine‑tuned judges (e.g., FARE‑20B) surpassing general‑purpose LLMs. 3) No single granularity dominates across all MAS; individual systems show clear preferences (fixed‑architecture MAS tend to benefit slightly from agent‑level verification, while adaptive‑architecture MAS favor iteration‑level). 4) Context length matters: providing raw full histories hurts performance due to token limits, while summarization (either of prior context or of context + current step) yields the best trade‑off between information richness and verifier reliability. 5) When broken down by problem difficulty, verification improves stability and modestly raises accuracy on easy/medium tasks but rarely rescues fundamentally unsolvable hard instances, indicating that verification can only guide the search when a correct solution lies within the model’s generation horizon.

Overall, the study concludes that current process verification paradigms are insufficient for robust, consistent gains in MAS. Reward models trained on single‑agent trajectories struggle with the out‑of‑distribution nature of multi‑agent partial states. Generative judges are more naturally suited to MAS test‑time scaling, yet they still face challenges related to computational cost, context management, and signal reliability. The authors suggest future work should focus on (a) training MAS‑specific reward models at scale, (b) developing dynamic context summarization techniques, and (c) designing tighter integration mechanisms between verification and generation to achieve more cost‑effective and dependable process verification for multi‑agent systems.

MAS-ProVe: Understanding the Process Verification of Multi-Agent Systems

💡 Research Summary

Comments & Academic Discussion

Leave a Comment