On the Uncertainty of Large Language Model-Based Multi-Agent Systems

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multi-agent systems (MAS) have emerged as a prominent paradigm for leveraging large language models (LLMs) to tackle complex tasks. However, the mechanisms governing the effectiveness of MAS built upon publicly available LLMs, specifically the underlying rationales for their success or failure, remain largely unexplored. In this paper, we revisit MAS through the perspective of uncertainty, considering both intra- and inter-agent dynamics by investigating entropy transitions during problem-solving across various topologies and six benchmark tasks. By analyzing 245 features spanning token-, trajectory-, and round-level entropy, we counterintuitively find that a single agent outperforms MAS in approximately 43.3% of cases, and that uncertainty dynamics are largely determined during the first round of interaction. Furthermore, we provide three key observations: 1) Certainty Preference: reducing uncertainty at any stage for any agent is critical for guaranteeing correct solutions; 2) Base Uncertainty: base models with lower entropy during problem-solving directly benefit MAS performance; and 3) Task Awareness: entropy dynamics of MAS play varying roles across different tasks. Building on these insights, we introduce a simple yet effective algorithm, the Entropy Judger, to select solutions from MAS’s pass@k results, leading to consistent accuracy improvements across all MAS configurations and tasks. Our source code is available at https://github.com/AgenticFinLab/multiagent-entropy.

💡 Research Summary

**
This paper investigates the role of uncertainty—measured as entropy—in large‑language‑model (LLM) based multi‑agent systems (MAS). While MAS have been touted as a powerful paradigm for solving complex tasks, the mechanisms that determine when they succeed or fail, especially when built on publicly available open‑source LLMs, have not been systematically studied. The authors revisit MAS from an uncertainty perspective, tracking entropy at token, agent, round, sample, and system levels across four coordination topologies (Sequential, Centralized, Debate, Hybrid) and six benchmark tasks (GSM8K, MATH500, AIME2024, AIME2025, HumanEval, MMLU).

Methodology

Models: Five open‑source LLMs (LLaMA‑3B, LLaMA‑8B, Qwen‑0.6B, Qwen‑4B, Qwen‑8B) are used as the shared base policy.
MAS Configurations: All systems run for two interaction rounds (R = 2). The four topologies differ in how agents exchange messages and how the final answer is selected (single‑agent output, orchestrator aggregation, majority voting, etc.).
Entropy Features: For each decoding step the token‑level entropy H(s)=−∑₍v∈V₎π(v|s)logπ(v|s) is logged. From these logs the authors derive 245 quantitative features, grouped into:
- Agent‑level statistics (156 features) – per‑agent mean, variance, inter‑agent divergence across rounds.
- Round‑level dynamics (27 features) – average, max, and change of entropy between round 1 and round 2.
- Sample‑level aggregates (29 features) – overall entropy statistics for a given problem instance.
- System‑level comparisons (10 features) – differences in entropy patterns across topologies.
- Base‑model entropy (17 features) – intrinsic uncertainty of the underlying LLM.
- Computational metrics (15 features) – time, token usage, inference counts, and base‑model correctness.

Key Findings

Single‑Agent Superiority in Many Cases: Across 30 model‑task‑topology scenarios, a single‑agent system (SAS) achieves the highest accuracy in 13 cases (43.3 %). SAS also matches or exceeds MAS in an additional 13 cases, indicating that adding agents does not guarantee improvement, especially for smaller models and math‑heavy tasks.
Early‑Round Entropy Dominates: Entropy measured in the first interaction round strongly predicts final correctness. SHAP analysis shows that “first‑round average token entropy” and “first‑round entropy variance” are the most important features; lower values correlate with higher success rates.
Base‑Model Uncertainty Limits MAS Gains: The total token‑level entropy of the base LLM (a “Base‑E” feature) is a top predictor of MAS performance. When the underlying model is already uncertain (high entropy), the multi‑agent collaboration tends to amplify errors rather than mitigate them. Conversely, low‑entropy base models enable MAS to achieve noticeable accuracy lifts.
Task‑Specific Entropy Dynamics: The importance of entropy reduction varies by task. For code generation (HumanEval), entropy drops in the second round are most beneficial, whereas for mathematical reasoning (GSM8K, MATH500) the critical reduction occurs already in round 1. This suggests that task‑aware scheduling of uncertainty‑reduction strategies could further improve MAS.
Entropy Judger – A Simple Post‑Processing Selector: Using the 245 entropy‑derived features, the authors train an ensemble of XGBoost and LightGBM models (the “Entropy Judger”) to predict the probability that a given answer is correct. At inference time, for each problem the MAS produces a pass@k list of candidate solutions; the Entropy Judger selects the candidate with the highest predicted correctness. This approach consistently improves accuracy across all topologies and benchmarks, yielding gains of 1.2 %–4.5 % over the raw MAS outputs, with the largest improvements observed for the weakest base models.

Implications
The study provides empirical evidence for the “certainty preference” principle: reducing uncertainty at any stage for any agent is strongly associated with correct outcomes. It also highlights the “base uncertainty” effect: the intrinsic entropy of the underlying LLM sets an upper bound on how much a multi‑agent architecture can help. Finally, the “task awareness” observation underscores that entropy dynamics are not universal; they must be interpreted in the context of the specific problem domain.

Future Directions
The authors suggest several avenues for extending this work: (i) designing prompts or decoding strategies that explicitly target entropy reduction in early rounds; (ii) developing adaptive topologies that reconfigure communication patterns based on real‑time entropy signals; (iii) scaling the analysis to larger, commercial LLMs (e.g., GPT‑4, Claude) to verify whether the same patterns hold; and (iv) integrating entropy‑guided loss functions during fine‑tuning to produce base models with inherently lower uncertainty.

In summary, this paper establishes uncertainty—captured via hierarchical entropy metrics—as a central explanatory factor for the performance of LLM‑based multi‑agent systems, demonstrates that early‑round entropy dynamics largely determine success, and introduces a lightweight, entropy‑driven selection algorithm that reliably boosts MAS accuracy across diverse tasks and configurations.

On the Uncertainty of Large Language Model-Based Multi-Agent Systems

💡 Research Summary

Comments & Academic Discussion

Leave a Comment