Roundtable Policy: Confidence-Weighted-Consensus Aggregation Improves Multi-Agent-System Reasoning
Multi-agent systems have demonstrated exceptional performance in downstream tasks beyond diverse single agent baselines. A growing body of work has explored ways to improve their reasoning and collaboration, from vote, debate, to complex interaction protocols. However, it still remains opaque why specific choice would be preferred in multi-agent systems. Inspired by the decision-making mechanism of democratic committees and The Society of Mind, we introduce Roundtable Policy, an inference-time reasoning framework for multi-agent systems that performs inference through the weighted consensus of multiple LLMs. Through extensive experiments, we demonstrate its that this approach significantly enhances reasoning in complex heterogeneous scientific tasks. Roundtable Policy emphasizes structured and interpretable inference rather than opaque convergence, while requires only black-box access and uniform procedures, making it broadly applicable to diverse multi-agent systems.
💡 Research Summary
The paper introduces “Roundtable Policy,” an inference‑time reasoning framework that aggregates the outputs of multiple large language models (LLMs) through a confidence‑weighted consensus mechanism. The authors observe that existing multi‑agent collaboration methods—simple majority voting, debate‑style dialogue, or generic ensembles—either treat all agents as equally trustworthy or produce opaque decision processes that are difficult to interpret and audit. Inspired by democratic committees and the Society of Mind, Roundtable Policy treats multi‑agent reasoning as a three‑phase committee process: (1) Proposition, where a set of “player” agents independently generate candidate answers to the same query; (2) Backward, where a set of “grader” agents evaluate each candidate using a rubric, producing a quality score (‑100 to 100) and a calibrated 95 % confidence interval that quantifies uncertainty; (3) Inference, where an “aggregator” agent synthesizes the candidates using a pre‑computed confidence‑weight table (ϑ).
The confidence‑weight table is the core of the method. After each round of evaluation, the (score, uncertainty) pairs are averaged across graders and accumulated for each player across tasks, yielding a matrix ϑ ∈ ℝ^{L_p × N}. Here L_p is the number of player agents and N denotes task‑specific dimensions (e.g., sub‑tasks in ScienceEval or rubric dimensions in ScienceNarrative). This table functions as a structured memory of each agent’s historical reliability and uncertainty, analogous to a reputation system in a committee. Once ϑ is learned, it remains fixed during deployment; for a new query, the players generate responses, and the aggregator combines them by weighting each response according to the corresponding entries in ϑ. Importantly, no LLM parameters are fine‑tuned; only the external table is updated, making the approach lightweight and compatible with black‑box APIs.
To evaluate the framework, the authors create two novel benchmarks that target distinct aspects of scientific reasoning. ScienceEval mixes heterogeneous problems from geoscience, biology, mathematics, and physical science, testing cross‑domain knowledge integration. ScienceNarrative requires agents to produce structured scientific proposals under a long‑context template, assessing narrative coherence and logical consistency over extended text. Both benchmarks expose limitations of traditional single‑shot or single‑domain evaluations such as GSM8K or PubMedQA.
Across a suite of baseline agents (including ChatGPT, LLaMA, and domain‑specific models) and existing multi‑agent strategies (majority voting, debate, and recent ensemble methods), Roundtable Policy consistently outperforms. Average gains are 13.01 % on ScienceEval and 11.04 % on ScienceNarrative. The authors also conduct extensive ablations: varying the number of player agents, grader agents, and the granularity of the confidence‑weight table; testing different initialization schemes; and measuring sensitivity to hyper‑parameters. Results show the method is robust: performance improvements persist across most configurations, and the confidence‑weight table converges quickly (within a few dozen rounds).
A notable investigation concerns grader bias. Individual graders can exhibit systematic preferences (e.g., favoring concise answers or particular phrasing). By aggregating multiple graders and incorporating the confidence interval into the weight calculation, the framework mitigates individual bias, leading to more stable and fair reliability estimates. The authors quantify inter‑grader consistency before and after aggregation, demonstrating a substantial reduction in variance.
The paper’s contributions are threefold: (1) Formalizing multi‑agent reasoning as a consensus‑formation problem with explicit, task‑aware reliability modeling; (2) Introducing the confidence‑weighted table as a structured, interpretable memory that guides aggregation without modifying underlying LLMs; (3) Providing two scientifically realistic benchmarks and showing that the approach yields significant accuracy and coherence gains over strong baselines.
Limitations are acknowledged. The current system relies on LLM‑based graders, so systematic errors in the graders can propagate into the confidence‑weight table. The table’s dimensionality grows with the number of task aspects, potentially increasing memory overhead for very fine‑grained evaluations. Future work is suggested in three directions: (a) integrating human expert graders to anchor the reliability estimates; (b) developing online or compressed versions of the confidence‑weight table for scalable deployment; and (c) extending the framework to non‑textual agents (e.g., code generators, simulation models).
In summary, Roundtable Policy offers a transparent, interpretable, and effective way to harness the complementary strengths of multiple LLMs for complex scientific reasoning. By quantifying and leveraging agent‑specific confidence, it moves beyond opaque ensemble methods and provides a practical blueprint for building trustworthy multi‑agent AI systems in research, engineering, and other domains where accuracy, consistency, and auditability are paramount.
Comments & Academic Discussion
Loading comments...
Leave a Comment