예산 제약 하 비용 효율적인 다중 에이전트 시스템 설계와 AgentBalance 프레임워크

February 23, 2026

Reading time: 6 minute

...

📝 Abstract

Large Language Model (LLM)-based multi-agent systems (MAS) have become indispensable building blocks for web-scale applications (e.g., web search, social network analytics, online customer support), with cost-effectiveness becoming the primary constraint on large-scale deployment. While recent advances seek to improve MAS cost-effectiveness by shaping inter-agent communication topology and selecting agent backbones, they seldom model and optimize under explicit token-cost and latency budgets that reflect deployment constraints, leading to topology-first designs and suboptimal cost-effectiveness under budget constraints. In this paper, we present AgentBalance, a framework for constructing cost-effective MAS under explicit token-cost and latency budgets via a backbone-then-topology design. Specifically, we first propose a backbone-oriented agent generation module that constructs agents with heterogeneous backbones via LLM pool construction, pool selection, and agent role-backbone matching. Then, we propose an adaptive MAS topology generation module that guides inter-agent communication through agent-representation learning, gating, and latency-aware topology synthesis. Extensive experiments on benchmarks with 14 candidate LLM backbones show that AgentBalance delivers up to 10% and 22% performance gains under matched token-cost and latency budgets, respectively, and achieves strong AUCs across benchmarks on performance-budget curves. It also works as a plug-in for existing MAS, further improving performance under the same token-cost and latency constraints, and exhibits strong inductive ability on unseen LLMs for practical, budget-aware deployment. Our code can be found at https://github.com/usail-hkust/AgentBalance _ CCS Concepts • Computing methodologies → Multi-agent systems; Natural language processing; Machine learning.

💡 Analysis

🇰🇷 한글로 읽기

📄 Content

LLM-based multi-agent systems (MAS) have demonstrated strong performance in web-scale applications such as web search [6], social-network forecasting [24], and online analytics [33], by decomposing complex tasks into specialized roles [13], integrating tool use and web APIs [34], and enabling inter-agent collaboration [19,47]. However, such coordination structurally couples system performance to token-cost (via API token consumption and inference-time computation [12]) and end-to-end latency: deeper interaction chains trigger more LLM calls, lengthen inter-agent contexts, and add serialization overhead [12]. Equipping agents with advanced Large Reasoning Models (LRMs), e.g., OpenAI o3 and DeepSeek-R1 [11,30], as backbones further amplifies this coupling in MAS [3,21,51]. As shown in Figure 1, while accuracy improves, the reasoning-heavy decoding and longer outputs of LRMs increase completion length and inference time [17,27], driving token-cost and latency to levels that are hard to sustain in practice. These elevated costs render many configurations impractical for production web services, e.g., ride-hailing dispatch [26], contact centers [8], and real-time social agents [58], where token-cost and latency are governed by explicit budgets or operational constraints. Accordingly, optimizing for performance alone is misaligned with deployment realities. MAS should be designed and evaluated in a budget-aware manner, maximizing performance subject to token-cost and latency constraints to improve cost-effectiveness.

Despite growing interest in powerful MAS [7,15,38,57,59], existing studies inadequately address budget-constrained settings on both the objective and methodology fronts. On the objective side, many works prioritize performance, treating token-cost as secondary and rarely emphasizing latency. They also seldom evaluate under explicit token-cost and latency budgets, leaving the practical deployability of these methods under realistic constraints unclear. On the methodology side, prior work concentrates on inter-agent communication [44,54,55], reducing communication redundancy or removing redundant agents for cost-effectiveness while typically assuming a single strong backbone, thereby overlooking how backbone size, family, and type shape cost-effectiveness. Very recent work such as MasRouter incorporates multiple LLM backbones for MAS [52], but remains topology-first and does not account for two properties of backbone choice for agents: (i) backbone choice is a primary driver of movement along the frontier between cost Figure 1: High performance in MAS often comes with tokencost and latency that exceed practical budgets. Stakeholders seek solutions that achieve competitive accuracy while respecting token-cost and latency budgets, motivating budgetaware, cost-effective MAS.

and performance relative to topology modification (Figure 2, left), and (ii) backbone choice shapes the optimal communication topology (Figure 2, right). Taken together, these observations motivate a backbone-first strategy: fix the backbone to define the feasible performance region under given budgets, then optimize topology within it. We therefore adopt a backbone-then-topology approach to construct cost-effective MAS under budget constraints. However, this design is nontrivial and leads to two complementary challenges:

(1) Constructing cost-effective agents with heterogeneous backbones. Assigning backbones to agents given the context can improve cost-effectiveness (e.g., a mid-sized non-reasoning LLM often suffices for simple tasks such as tool invocation [4]). However, this is nontrivial in MAS: the set of candidate backbones is large and heterogeneous (sizes from billions to hundreds of billions; reasoning and non-reasoning), spanning a wide frontier between cost and performance. Appropriate assignment depends jointly on the query and the agent role, with mismatched assignments markedly degrading performance (Figure 2, middle). Thus, how to assign backbones precisely so that each agent’s capability matches task demands within budget constraints is the first challenge.

(2) Designing communication topology for agents with heterogeneous backbones. Designing role-aware communication patterns yields joint gains in performance and token-cost [18,36,49]. However, constructing a latency-aware communication topology for agents with heterogeneous backbones is nontrivial: the optimal topology of a MAS depends on the agents’ backbones (Figure 2, right). In addition, backbone heterogeneity across roles and queries exacerbates this issue, making it substantially harder to estimate each agent’s impact in the system and to decide with whom to communicate under latency considerations. Therefore, how to estimate each agent’s marginal contribution and adaptively design a latency-aware communication topology within budget constraints is the second challenge.

To address the aforementioned challenges, we propose Agent-Balance, a unified framework

View Original ArXiv

This content is AI-processed based on ArXiv data.

예산 제약 하 비용 효율적인 다중 에이전트 시스템 설계와 AgentBalance 프레임워크

📝 Abstract

💡 Analysis

📄 Content

Table of Contents

Table of Contents

📝 Abstract

💡 Analysis

📄 Content

Start searching

No results found