AgentArk: Distilling Multi-Agent Intelligence into a Single LLM Agent
While large language model (LLM) multi-agent systems achieve superior reasoning performance through iterative debate, practical deployment is limited by their high computational cost and error propagation. This paper proposes AgentArk, a novel framework to distill multi-agent dynamics into the weights of a single model, effectively transforming explicit test-time interactions into implicit model capabilities. This equips a single agent with the intelligence of multi-agent systems while remaining computationally efficient. Specifically, we investigate three hierarchical distillation strategies across various models, tasks, scaling, and scenarios: reasoning-enhanced fine-tuning; trajectory-based augmentation; and process-aware distillation. By shifting the burden of computation from inference to training, the distilled models preserve the efficiency of one agent while exhibiting strong reasoning and self-correction performance of multiple agents. They further demonstrate enhanced robustness and generalization across diverse reasoning tasks. We hope this work can shed light on future research on efficient and robust multi-agent development. Our code is at https://github.com/AIFrontierLab/AgentArk.
💡 Research Summary
AgentArk tackles the fundamental trade‑off in large language model (LLM) multi‑agent systems (MAS): while MAS achieve impressive reasoning performance through iterative debate, critique, and consensus, they incur prohibitive inference‑time computation and risk error amplification across agents. The authors propose a novel distillation framework that shifts the burden of multi‑agent reasoning from inference to training, enabling a single LLM to internalize the collective intelligence of a MAS.
The framework consists of three hierarchical distillation strategies. First, Reasoning‑Enhanced Supervised Fine‑Tuning (R‑SFT) trains the student model on both the final consensus answer and the full reasoning trajectory generated by the teacher MAS, optimizing a combined loss for intermediate rationales and the final answer. Second, Trajectory‑Based Data Augmentation (DA) extracts multiple “correct‑first” reasoning paths that lead to the ground‑truth answer but differ in logical heuristics, mathematical identities, or starting assumptions; these diverse paths are used as augmented supervision, encouraging the student to learn several valid solution routes. Third, Process‑Aware Distillation (P‑AD) introduces a Process Reward Model (PRM) that predicts step‑level correctness via a contrastive reward, and trains the student policy with Group Relative Policy Optimization (GRPO), a reinforcement‑learning algorithm that compares rewards within a sampled group of outputs, eliminating the need for a separate value function.
Data generation uses a debate‑style MAS where n homogeneous agents interact for K rounds, each conditioning on peers’ previous traces. The pipeline filters for trajectories that eventually reach the correct answer, even if earlier steps contained mistakes, thereby emphasizing self‑correction behavior.
Experiments span three major model families—Qwen‑3, Gemma‑3, and Llama‑3—distilling from large teachers (e.g., Qwen‑3‑32B, Gemma‑3‑27B‑it) to smaller students (e.g., Qwen‑3‑0.6B, Gemma‑7B, Llama‑3‑8B). Across eight reasoning benchmarks (math, medical, code, commonsense, etc.), the authors report: (1) R‑SFT alone yields 2–4 percentage‑point (pp) gains over a baseline single‑agent; (2) adding DA provides an extra 1–2 pp by exposing the student to diverse logical routes; (3) P‑AD delivers the largest boost, achieving 5–7 pp overall improvement and markedly better step‑wise error detection and self‑correction. PRM capacity proves more critical than student size: a high‑capacity PRM benefits even small students, while weak PRMs limit gains. Scaling the number of teacher agents benefits larger students but shows diminishing returns for tiny models.
Robustness tests (noise injection, prompt paraphrasing) and domain transfer experiments (e.g., training on math, evaluating on medical) demonstrate that distilled models retain higher accuracy and stability than non‑distilled baselines, confirming that internalizing reasoning dynamics improves generalization. The framework also extends to multimodal LLMs, showing similar benefits when reasoning over image‑text inputs.
Limitations include reliance on debate‑based MAS (so applicability to tool‑using or memory‑augmented agents remains untested), the need for step‑level correctness labels to train PRM, and the potential for teacher bias to propagate to the student. Training costs are front‑loaded, requiring large teacher models for data generation.
In summary, AgentArk provides a comprehensive, scalable approach to compress multi‑agent reasoning into a single LLM, achieving comparable accuracy, self‑correction, and robustness while drastically reducing inference latency. Future work may explore integrating other collaboration paradigms, developing label‑free PRM training, and mitigating teacher bias to further broaden the utility of single‑agent distilled intelligence.
Comments & Academic Discussion
Loading comments...
Leave a Comment