Self-Improvement of Language Models by Post-Training on Multi-Agent Debate

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Self-improvement, where models improve beyond their current performance without external supervision, remains a challenge. The core difficulty is sourcing a training signal stronger than what the model itself can currently produce. Majority voting has been shown to provide such a signal by aggregating over multiple samples, helping mitigate some of the inconsistencies in LM reasoning. In this work, we show that multi-agent debate–where models collaborate and exchange reasoning over multiple rounds–provides an even richer signal than single-round majority voting. We introduce Multi-Agent Consensus Alignment (MACA), which uses reinforcement learning (RL) to post-train models to effectively utilize multi-agent debate. We find that preference learning over full reasoning traces, learning to differentiate between majority and minority reasoning, is more effective than binary consensus rewards or SFT-based approaches for leveraging these debate signals. This produces three key improvements: models are (1) better at utilizing the multi-agent debate setting (+26.87% on MATH), (2) individually more accurate (+21.51% on MathQA), and (3) more self-consistent (+27.6% on GSM8K). We also see strong generalization to unseen benchmarks (+16.3% on GPQA, +11.6% on CommonsenseQA).

💡 Research Summary

The paper tackles the long‑standing problem of self‑improvement in language models—enhancing a model’s performance without relying on external supervision. The authors argue that the primary obstacle is obtaining a training signal that is stronger than what the model can already generate. While majority‑vote aggregation of multiple sampled reasoning paths provides a useful signal, it only mitigates inconsistencies at inference time and does not address the underlying instability of the model’s internal reasoning.

To overcome this, the authors introduce Multi‑Agent Consensus Alignment (MACA), a reinforcement‑learning framework that post‑trains a language model using signals derived from multi‑agent debate. In MACA, several identical copies of a base model engage in an iterative discussion: each agent first produces an answer and a reasoning trace, then all agents read each other’s traces and update their answers over a few rounds. After the final round, the majority answer is identified, and the corresponding reasoning traces are split into “consensus‑supporting” (G⁺) and “dissenting” (G⁻) groups. This creates a self‑generated dataset where G⁺ traces are treated as preferred and G⁻ as non‑preferred.

Four post‑training objectives are explored:

MV‑SFT – supervised fine‑tuning that simply imitates the G⁺ traces.
MV‑GRPO – a reinforcement‑learning objective that samples new traces, assigns a binary reward based on whether the trace’s final answer matches the majority, and uses group‑normalized advantages.
MV‑DPO – a Direct Preference Optimization loss that forms explicit preference pairs (y⁺, y⁻) from G⁺ and G⁻ and maximizes the log‑probability ratio of the preferred over the non‑preferred trace.
MV‑KTO – an unpaired Kullback‑Leibler‑based objective that handles class imbalance by weighting the G⁺ and G⁻ terms separately.

The key insight is that preference learning over full reasoning traces (as in DPO and KTO) is far richer than binary consensus rewards. By contrasting entire argumentation sequences, the model learns not just which final answer is correct, but which patterns of reasoning tend to survive peer deliberation. This “peer‑grounded” signal requires no external ground truth and is robust even when the underlying problem is ambiguous.

Experiments are conducted on four instruction‑tuned small models (Qwen‑2B, Llama‑3B, Phi‑4B, Llama‑8B) using 4‑bit quantization (QLoRA) on a single‑node cluster. The debate configuration uses three agents (M = 3) and two deliberation rounds (R = 2) with a temperature τ = 1.0 to encourage diverse sampling. Six reasoning benchmarks are evaluated: MATH, GSM8K, MathQA, SVAMP, GPQA, and CommonsenseQA. For each benchmark, a 1500‑example training split and a 500‑example test split are used, and results are averaged over three random seeds.

Key findings include:

Improved debate performance: Across all models, MACA raises both the average accuracy of agents and the proportion of agents that agree with the majority answer after the final round. For example, Llama‑3B on MATH improves from an initial average of 35.33 % to a final average of 52.93 % (+15.6 % points).
Higher individual accuracy: When the post‑trained model is used in a single‑agent setting, accuracy jumps dramatically. Llama‑8B on MathQA rises from 44.60 % (baseline) to 69.27 % after MACA‑DPO (+24.67 % points). Similar gains are observed on other datasets, with the best improvements consistently coming from the DPO and KTO variants.
Enhanced self‑consistency: On GSM8K, the fraction of sampled reasoning paths that converge on the same answer (self‑consistency) increases by 27.6 % points, indicating that the model’s internal reasoning becomes more stable across stochastic sampling.
Generalization to unseen domains: Even on benchmarks not used during post‑training, MACA yields sizable lifts: GPQA (+16.3 % points) and CommonsenseQA (+11.6 % points), demonstrating that the learned consensus‑driven reasoning transfers beyond the training distribution.
Ablation of objectives: Preference‑based objectives (DPO, KTO) outperform both pure imitation (MV‑SFT) and reward‑only RL (MV‑GRPO), confirming that learning to differentiate between consensus‑supporting and dissenting traces is crucial.

The authors also discuss limitations. The current setup only uses identical model copies; mixing heterogeneous models could further enrich debate dynamics but is left for future work. Computational cost scales with the number of agents and debate rounds, so more efficient sampling or distillation strategies will be needed for larger models. Finally, the optimal number of agents and rounds is not systematically explored.

In summary, MACA introduces a novel self‑supervised loop: multi‑agent debate generates a richer training signal than simple majority voting; reinforcement learning with preference over full reasoning traces internalizes this signal; the resulting model becomes better at both collaborative debate and solitary reasoning, while also becoming more self‑consistent. This work represents a significant step toward truly self‑improving language models that can refine their own reasoning without external annotation.

Self-Improvement of Language Models by Post-Training on Multi-Agent Debate

💡 Research Summary

Comments & Academic Discussion

Leave a Comment