Systematic Failures in Collective Reasoning under Distributed Information in Multi-Agent LLMs

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multi-agent systems built on large language models (LLMs) are expected to enhance decision-making by pooling distributed information, yet systematically evaluating this capability has remained challenging. We introduce HiddenBench, a 65-task benchmark grounded in the Hidden Profile paradigm, which isolates collective reasoning under distributed information from individual reasoning ability. Evaluating 15 frontier LLMs, we find that multi-agent LLMs achieve only 30.1% accuracy under distributed information, compared to 80.7% accuracy for single agents given complete information. We trace this gap to a systematic failure mode: agents cannot recognize or act under latent information asymmetry-they fail to reason about what others might know but have not yet expressed, leading to premature convergence on shared evidence while critical distributed facts remain unexplored. These failures persist across prompting strategies, communication depths, and group sizes-and worsen as groups scale. While some models (e.g., Gemini-2.5-Flash/Pro) outperform others, neither model scale nor individual reasoning accuracy reliably predicts collective performance. Our results identify failures in collective information exploration in decision-making as a key limitation of multi-agent LLMs, and provide a theory-grounded, reproducible framework for diagnosing collective reasoning failures.

💡 Research Summary

This paper investigates the ability of large language model (LLM)‑based multi‑agent systems to pool distributed information and make better decisions than any single agent. To isolate collective reasoning from individual reasoning ability, the authors introduce HiddenBench, a reproducible benchmark of 65 tasks derived from the classic “Hidden Profile” paradigm in social psychology. In each task, a set of “shared” facts is given to all agents, while a complementary set of “unshared” facts is uniquely assigned to individual agents. The shared facts are deliberately constructed to support an incorrect option, whereas only the combination of all unshared facts points to the correct answer. Thus, success requires agents to recognize that they each hold latent, private information and to actively elicit it from one another.

The authors first validate the benchmark with human groups and with GPT‑4.1 agents, confirming that (i) under a “Full Profile” condition where every agent receives the complete fact set, both humans and GPT‑4.1 achieve high accuracy (≈73‑73% for GPT‑4.1, ≈60% for humans), and (ii) under the “Hidden Profile” condition, pre‑discussion accuracy is near zero, but post‑discussion performance improves modestly yet remains far below the Full Profile ceiling. This demonstrates that the tasks truly require collective information integration and that any failure cannot be blamed on task difficulty alone.

Using HiddenBench, the authors evaluate 15 state‑of‑the‑art LLMs (including Gemini‑2.5‑Flash/Pro, GPT‑4.1, Claude‑3, Llama‑2, etc.) in multi‑agent settings. Agents are organized in groups of 3–5 and allowed 2–4 rounds of dialogue. Three prompting strategies are tested: free‑form chat, structured Q&A, and meta‑prompts that explicitly ask agents to “search for missing information”. Across all configurations, multi‑agent systems achieve only 30.1% accuracy on Hidden Profile tasks, whereas a single agent given the Full Profile reaches 80.7% accuracy. The performance gap persists regardless of prompting style, communication depth, or model family, and it worsens as group size increases (larger groups converge earlier on the shared, misleading evidence).

Through ablation studies, the authors identify a systematic failure mode: agents do not reason about latent information asymmetry. Instead of treating other agents as potential holders of unknown facts, they adopt a myopic policy that maximizes immediate information gain from what is already observable. Consequently, they quickly converge on the shared, incorrect evidence and stop asking probing questions that would reveal the unshared facts. This behavior mirrors human biases such as shared‑information bias and premature consensus formation observed in classic Hidden Profile experiments.

Importantly, the study finds that model scale or individual reasoning accuracy does not predict collective performance. For example, Gemini‑2.5‑Flash/Pro, which excels on single‑agent benchmarks, only modestly outperforms smaller models in the multi‑agent Hidden Profile setting. This suggests that simply scaling LLMs will not solve the coordination problem; instead, new training objectives or interaction protocols that explicitly reward epistemic exploration are needed.

The authors propose a theoretical framing: under partial observability, optimal agents should maximize expected information gain about the hidden state, which can be formalized as a reinforcement‑learning objective. Current LLMs, trained to predict the next token, lack such long‑term, exploration‑oriented incentives. Therefore, they recommend (1) designing meta‑prompts that embed an “information‑seeking” goal, (2) fine‑tuning or RL‑training multi‑agent policies with a reward that reflects the reduction of uncertainty about the correct answer, and (3) dynamically allocating dialogue turns so that each agent gets at least one opportunity to contribute unique information.

In summary, the paper makes three major contributions: (1) the HiddenBench benchmark, which cleanly separates collective reasoning failures from individual ability; (2) a comprehensive empirical finding that current multi‑agent LLM systems dramatically underperform single agents when information is distributed, across a wide range of models and settings; and (3) a diagnosis of the root cause—failure to recognize and act upon latent information asymmetry—and a roadmap for future work that emphasizes epistemic exploration, coordinated questioning, and reward‑driven policy learning. This work highlights a fundamental limitation of today’s multi‑agent LLMs and provides a concrete, reproducible platform for the community to develop and evaluate solutions.

Systematic Failures in Collective Reasoning under Distributed Information in Multi-Agent LLMs

💡 Research Summary

Comments & Academic Discussion

Leave a Comment