에이전트 시스템 스케일링 원리 다중 에이전트 협업과 모델 능력의 정량적 분석

February 20, 2026

Reading time: 6 minute

...

📝 Original Info

Title: 에이전트 시스템 스케일링 원리 다중 에이전트 협업과 모델 능력의 정량적 분석
ArXiv ID: 2512.08296
Date: Pending
Authors: ** Yubin Kim¹³†, Ken Gu¹, Chanwoo Park³, Chunjong Park², Samuel Schmidgall², A. Ali Heydari¹, Yao Yan¹, Zhihan Zhang¹, Yuchen Zhuang², Yun Liu¹, Mark Malhotra¹, Paul Pu Liang³, Hae Won Park³, Yuzhe Yang¹, Xuhai Xu¹, Yilun Du¹, Shwetak Patel¹, Tim Althoff¹, Daniel McDuff¹, Xin Liu¹† ¹Google Research, ²Google DeepMind, ³Massachusetts Institute of Technology †Corresponding authors: ybkim95@mit.edu, xliucs@google.com — **

📝 Abstract

Agents, language model (LM)-based systems that are capable of reasoning, planning, and acting are becoming the dominant paradigm for real-world AI applications. Despite this widespread adoption, the principles that determine their performance remain underexplored, leaving practitioners to rely on heuristics rather than principled design choices. We address this gap by deriving quantitative scaling principles for agent systems. We first formalize a definition for agentic evaluation and characterize scaling laws as the interplay between agent quantity, coordination structure, model capability, and task properties. We evaluate this across four diverse benchmarks: Finance-Agent, BrowseComp-Plus, PlanCraft, and Workbench, spanning financial reasoning, web navigation, game planning, and workflow execution. Using five canonical agent architectures (Single-Agent System and four Multi-Agent Systems: Independent, Centralized, Decentralized, Hybrid), instantiated across three LLM families, we perform a controlled evaluation spanning 180 configurations, standardizing tools, prompt structures, and token budgets to isolate architectural effects from implementation confounds. We derive a predictive model using empirical coordination metrics, including efficiency, overhead, error amplification, and redundancy, that achieves cross-validated 𝑅 2 =0.524, enabling prediction on unseen task domains by modeling task properties rather than overfitting to a specific dataset. We identify three dominant effects: (1) a tool-coordination trade-off : under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead. (2) a capability saturation: we observe that coordination yields diminishing or negative returns ( β=-0.404, 𝑝<0.001) once single-agent baselines exceed an empirical threshold of ∼45%. (3) topology-dependent error amplification: independent agents amplify errors 17.2× through unchecked propagation, while centralized coordination contains this to 4.4×. Crucially, coordination benefits are task-contingent. Centralized coordination improves performance by 80.8% on parallelizable tasks like financial reasoning, while decentralized coordination excels on dynamic web navigation (+9.2% vs. +0.2%). Yet for sequential reasoning tasks, every multi-agent variant we tested degraded performance by 39-70%. The framework predicts the optimal coordination strategy for 87% of held-out configurations. Out-of-sample validation on GPT-5.2, released after our study, achieves MAE=0.071 and confirms four of five scaling principles generalize to unseen frontier models, providing a quantitatively predictive framework for agentic scaling based on measurable task properties.

💡 Deep Analysis

📄 Full Content

Towards a Science of Scaling Agent Systems Yubin Kim1,3,†, Ken Gu1, Chanwoo Park3, Chunjong Park2, Samuel Schmidgall2, A. Ali Heydari1, Yao Yan1, Zhihan Zhang1, Yuchen Zhuang2, Yun Liu1, Mark Malhotra1, Paul Pu Liang3, Hae Won Park3, Yuzhe Yang1, Xuhai Xu1, Yilun Du1, Shwetak Patel1, Tim Althoff1, Daniel McDuff1 and Xin Liu1,† 1Google Research, 2Google DeepMind, 3Massachusetts Institute of Technology, †Corresponding Author Agents, language model (LM)-based systems that are capable of reasoning, planning, and acting are becoming the dominant paradigm for real-world AI applications. Despite this widespread adoption, the principles that determine their performance remain underexplored, leaving practitioners to rely on heuristics rather than principled design choices. We address this gap by deriving quantitative scaling principles for agent systems. We first formalize a definition for agentic evaluation and characterize scaling laws as the interplay between agent quantity, coordination structure, model capability, and task properties. We evaluate this across four diverse benchmarks: Finance-Agent, BrowseComp-Plus, PlanCraft, and Workbench, spanning financial reasoning, web navigation, game planning, and workflow execution. Using five canonical agent architectures (Single-Agent System and four Multi-Agent Systems: Independent, Centralized, Decentralized, Hybrid), instantiated across three LLM families, we perform a controlled evaluation spanning 180 configurations, standardizing tools, prompt structures, and token budgets to isolate architectural effects from implementation confounds. We derive a predictive model using empirical coordination metrics, including efficiency, overhead, error amplification, and redundancy, that achieves cross-validated 𝑅2=0.524, enabling prediction on unseen task domains by modeling task properties rather than overfitting to a specific dataset. We identify three dominant effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead. (2) a capability saturation: we observe that coordination yields diminishing or negative returns (ˆ𝛽=−0.404, 𝑝<0.001) once single-agent baselines exceed an empirical threshold of ∼45%. (3) topology-dependent error amplification: independent agents amplify errors 17.2× through unchecked propagation, while centralized coordination contains this to 4.4×. Crucially, coordination benefits are task-contingent. Centralized coordination improves performance by 80.8% on parallelizable tasks like financial reasoning, while decentralized coordination excels on dynamic web navigation (+9.2% vs. +0.2%). Yet for sequential reasoning tasks, every multi-agent variant we tested degraded performance by 39–70%. The framework predicts the optimal coordination strategy for 87% of held-out configurations. Out-of-sample validation on GPT-5.2, released after our study, achieves MAE=0.071 and confirms four of five scaling principles generalize to unseen frontier models, providing a quantitatively predictive framework for agentic scaling based on measurable task properties. 1. Introduction Agents (Wang et al., 2024a), language model-driven systems that operate through iterative cycles of reasoning, planning, and acting, adapting their behavior based on environmental or tool-generated feedback, have achieved remarkable performance in diverse applications, from code generation (Yang et al., 2024; Zhang et al., 2024), web browsing (Wei et al., 2025; Yao et al., 2022), medical decision-making (Heydari et al., 2025; Kim et al., 2024; McDuff et al., 2025), finance (Yu et al., 2025), sustainability (Zhang et al., 2025b), to scientific discovery (Gottweis et al., 2025; Mitchener et al., 2025). As tasks grow in complexity and require sustained environmental interaction, the field has increasingly turned to multi-agent systems (MAS), relying on the premise that specialized Corresponding author(s): ybkim95@mit.edu, xliucs@google.com © 2025 Google. All rights reserved arXiv:2512.08296v2 [cs.AI] 17 Dec 2025 Towards a Science of Scaling Agent Systems GPT-5 nano GPT-5 mini GPT-5 Gemini 2.0 Flash Gemini 2.5 Flash Gemini 2.5 Pro Sonnet 3.7 Sonnet 4.0 Sonnet 4.5 +8.7% +8.1% -4.6% Figure 1 | Agent Scaling across model intelligence and system architectures. Average performance (%) across four agentic benchmarks improves consistently with increasing model Intelligence Index (see Appendix A) across three major LLM families (OpenAI, Google, and Anthropic) under different agent configurations. Single Agent System (SAS) serves as reference trajectories, while Multi Agent System (MAS) variants (Centralized, Decentralized, Independent, and Hybrid) reveal distinct scaling behaviors (see Table 2 for architecture comparisons). All percentage deltas annotated in the figure (e.g., +8.7%, +8.1%, –4.6%) indicate relative performance change of the best-performing MAS variant compared to the SAS baseline at the same Intelligen

📄 Read Full PDF on ArXiv