IEMAS: An Incentive-Efficiency Routing Framework for Open Agentic Web Ecosystems

The transition to open, distributed Multi-Agent Systems (MAS) promises scalable intelligence but introduces a non-trivial tension: maximizing global efficiency requires cooperative, resource-aware scheduling, yet autonomous agents may be self-interes…

Authors: Hongze Liu, Chang Guo, Yingzeng Li

IEMAS: An Incentive-Efficiency Routing Framework for Open Agentic Web Ecosystems
IEMAS: An Incentiv e-Efficiency Routing Framework f or Open Agentic W eb Ecosystems Hongze Liu * 1 Chang Guo * 1 Y ingzeng Li 1 Mengru W ang 1 Jiong Lou 1 Shijing Y uan 1 Hefeng Zhou 1 Chentao W u 1 Jie Li 1 Abstract The transition to open, distributed Multi-Agent Systems (MAS) promises scalable intelligence but introduces a non-tri vial tension: maximizing global efficienc y requires cooperative, resource- aware scheduling, yet autonomous agents may be self-interested and cannot be managed by a centralized controller . Prior approaches fall short in two key areas: they typically focus on single-query routing, neglecting long-term re- source reuse (e.g., KV -caching) and the com- plexities of system-level many-to-many match- ing; furthermore, they rely on generic incentive mechanisms that ignore the distinct characteris- tics of LLM inference. T o bridge this gap, we propose IEMAS ( I ncentive- E f ficiency Mecha- nism for M ulti- A gent S ystems), a framew ork that aligns economic incenti ves with system perfor- mance. IEMAS integrates a probabilistic predic- tiv e model to estimate Quality of Service (QoS) under uncertainty , which feeds into a VCG-based bipartite matching mechanism. This design guar- antees truthful capability reporting and social op- timality while explicitly leveraging KV cache- affinity to minimize computational redundancy . W e implement IEMAS on top of vLLM and e v al- uate it via extensi ve simulations. Results demon- strate that our incenti ve-ef ficiency co-design re- ducing av erage service cost by 35% and end-to- end latency by up to 2.9 × compared to baselines. 1. Introduction Large Language Models hav e enabled autonomous agents with strong reasoning and planning abilities, motiv ating recent work on multi-agent LLM systems where comple- mentary agents collaborate to solv e complex tasks ( Guo 1 School of Computer Science, Shanghai Jiao T ong Uni versity , China. Correspondence to: Jie Li < lijiecs@sjtu.edu.cn > . Pr eprint. Mar ch 19, 2026. et al. , 2024 ). In practical deployments, such agents are of- ten hosted by heterogeneous entities and distrib uted across networked infrastructures ( Guo et al. , 2025 ). Emerging paradigms, such as BetaW eb ( Guo et al. , 2025 ), Agen- tic W eb( Y ang et al. , 2025c ), DA WN ( Aminiranjbar et al. , 2025 ), and the Internet of Agents (IoA) ( Chen et al. ; W ang et al. , 2025 ), en vision open, decentralized ecosystems where agents dynamically coordinate over the network to serve div erse client requests via flexible workflows. Howe ver , coordinating LLM agents in open, distributed ecosystems creates three interdependent challenges ( W ang et al. , 2025 ). P1. Scheduling and routing: Matching con- current client requests to heterogeneous, capacity-limited agents to maximize ef ficiency while preserving KV -prefix locality and meeting service constraints. P2. Incentives: Independent providers with pri v ate costs may misreport ca- pabilities or selectiv ely serv e requests unless economically motiv ated. P3. Communication scalability: Broadcasting agent details or context histories between all participants in- curs prohibiti ve latency and cost, rendering full transparenc y infeasible. Most prior work on request routing ( P1 ) relies on central- ized controllers or trusted en vironments, assumptions that break do wn at web scale ( Kwon et al. , 2023 ). While some studies address planning for specific multi-agent workflo ws ( Y ang et al. , 2025b ), they typically assume a one-to-many dispatch structure. In contrast, open networks exhibit a many-to-many structure where multiple heterogeneous re- quests must be jointly matched to distrib uted agents. Con- sequently , routing strategies optimal for single-query dis- patch become suboptimal in multi-agent scenarios. Cru- cially , while techniques like ke y–value (KV) caching yield order-of-magnitude speedups for related contexts, na ¨ ıve load balancing destroys cache locality , significantly increas- ing latency and computational cost ( Kwon et al. , 2023 ; Mi- crosoft , 2023 ). Beyond scheduling inef ficiencies, the absence of verifiable mechanisms ( P2 ) introduces strategic risks: agents may misreport capabilities or behav e opportunistically . W ithout incentiv e alignment, e ven carefully engineered schedulers fail. While mechanism design in distributed systems has 1 IEMAS: An Incentive-Efficiency Routing Framew ork for Open Agentic W eb Ecosystems shown promise in steering strate gic actors, existing propos- als using game theory ( W ang et al. , 2024 ), auctions ( Y ou et al. , 2024 ), contract theory ( Y e et al. , 2024 ), and persuasion ( Zhong et al. , 2025 ) generally treat agents as homogeneous resources. They fail to incorporate constraints intrinsic to LLM services, such as unpredictable generation lengths ( Sun et al. , 2024 ; Chen et al. , 2025a ), skill specialization ( Y ang et al. , 2025b ), and the critical v alue of KV -cache reuse ( Pan et al. ). Furthermore, time-to-first-token (TTFT) effects directly shape percei ved latenc y ( Chen et al. , 2025a ). Ignoring these factors compromises the incentiv e structures required to elicit truthful, high-quality participation. T o design an or ganic LLM-MAS, mechanisms must ex- plicitly incorporate these specific factors. Such integration enables allocation rules that preserve cache locality and balance short-term throughput with long-term cooperation. While auction theory provides a foundation for incenti ve design under asymmetric information ( Qiu et al. , 2022 ), adapting it to LLM agent inference is non-tri vial due to KV -cache dependencies and stochastic performance ( Sun et al. , 2024 ; Chen et al. , 2025a ). Moreover , in open environ- ments, agent information ( P3 ) is often incomplete, limiting the applicability of centralized optimization and motiv ating the need for distributed, incenti ve-aware mechanisms. W e address these gaps by proposing IEMAS ( I ncentive- E fficienc y Mechanism for M ulti- A gent S ystems), a dis- tributed routing frame work that matches client requests to suitable agents while jointly aligning economic incenti ves and exploiting computational resources such as KV caching. IEMAS introduces a lightweight proxy layer , where each serving node deploys a gate way for query management and cache coordination, reducing communication ov erhead and enabling enforceable pricing without re vealing propri- etary internals. By le veraging KV -prefix caching, multi- turn queries reuse prior computation, while a cache-aw are predictiv e model captures quality-of-service (QoS) signals, including performance, latency , partial cache states, and costs. Based on these signals, clients and agents participate in proxy-le vel auctions, and a Min-Cost Max-Flow (MCMF) mechanism determines the final request-to-agent routing by jointly considering bidding outcomes, capability alignment, and dynamic workload conditions. W e implement IEMAS on vLLM , demonstrating stable agent utilities and improved long-term social welfare. T o the best of our kno wledge, this work is the first to in ves- tigate an incentiv e–ef ficiency co-design routing frame work in agentic web ecosystems. The main contributions are summarized as follows: • W e formalize the client–agent interaction model for dis- tributed LLM agent services with agent specialization. • W e develop a cache-a ware predictiv e scheme that esti- mates capability , expected latenc y , and cost metrics. • W e design an incentive-compatible auction mechanism that promotes truthful reporting and social welfare. • W e implement the proposed frame work on vLLM and perform extensi ve experiments. The related code is av ailable at: https://github. com/PACHAKUTlQ/IMMAS . 2. Related W ork Our work bridges distrib uted multi-agent systems, ef ficient LLM serving, and algorithmic game theory . W e revie w these domains and position IEMAS within the emerging landscape of the open Agentic W eb . LLM-MAS and the Agentic W eb: The paradigm of LLM- based Multi-Agent Systems (LLM-MAS) is shifting from closed, monolithic frame works to open, decentralized net- works. Early systems like AutoGEN ( W u et al. , 2023 ), MetaGPT ( Hong et al. , 2024 ), and CAMEL ( Li et al. , 2023 ) demonstrated collaborativ e problem-solving but assumed centralized control and cooperativ e agents. Recent research en visions an Agentic W eb or Internet of Agents (IoA) , where heterogeneous agents dynamically disco ver and co- ordinate across the internet. Protocols such as IoA ( Chen et al. , 2025b ) and OpenAgents ( Xie et al. , 2023 ) establish the connecti vity layer for this vision, while infrastructures like BetaW eb ( Guo et al. , 2025 ) and D A WN ( Aminiranjbar et al. , 2025 ) focus on trust and decentralization. Howe ver , while these frame works solve connectivity , they largely ov er- look the critical economic and system-level inef ficiencies of distributing hea vy inference workloads. Unlike IEMAS, they lack mechanisms to optimize the high data-mo vement costs of context transfer or prev ent free-riding in trustless en vironments. LLM Routing and Cache-A ware Scheduling: Ef ficient routing is paramount for memory-bound T ransformer in- ference. Optimizations like vLLM ( Kwon et al. , 2023 ) and Orca ( Y u et al. , 2022 ) maximize throughput via non- contiguous memory management, while SGLang ( Zheng et al. , 2023 ) and Prompt Cache ( Gim et al. , 2024 ) utilize RadixAttention to achiev e order-of-magnitude latency re- ductions via KV -prefix reuse. Howe ver , existing routing strategies (e.g., S-LoRA ( Sheng et al. , 2023 ), Splitwise ( Patel et al. , 2024 )) are designed for centralized clusters with full state visibility . In open en vironments, na ¨ ıve load balancing scatters semantically related queries, destroying cache locality . IEMAS addresses this gap by introduc- ing incentive-compatible cache-awar e r outing , ensuring re- quests are routed based on the economic v aluation of cached states rather than simple av ailability . Mechanism Design f or LLM Ecosystems: As agents be- come autonomous economic actors, mechanism design is required to ensure truthful reporting. While auction the- 2 IEMAS: An Incentive-Efficiency Routing Framew ork for Open Agentic W eb Ecosystems Query A Code Gen Query B Data Analysis Query C Creative W riting Distri buted Servic e Router Re qu es t Analysis & Rout ing Logic Ag en t Dire ctory & Status Moni tor Serv ice Age nts Agent 1 Spec iali zation : Mat h/ Lo gi c Capa city : Lo w Concurre nt Queries (Tran smit to next agent) Wo rkflow Orchestration (Tim e Sl ot T) Availability Signals & Capabilit ies Agent 2 Speciaiization: Coding Capacity:High Agent 3 Specialization:Generalist Capacity: Medium Routing Logic/ Agent Directory Time Slot t Agentic W eb Logical Control Plane (Decision & State) Network Execute Plane (T opology & Pathing) F igure 1. The Illustration of Agentic W eb Routing. ory is well-established for cloud resource allocation, stan- dard models do not account for the stochastic, ”stateful” nature (context cache) of LLM inference. Recent works on AI marketplaces ( Bansal et al. , 2025 ; Y ang et al. , 2025d ) and truthful capability re velation ( Liu et al. , 2025 ) address strategic risks b ut often decouple the mechanism from sys- tem performance, treating compute as a generic commodity . IEMAS bridges this gap by embedding cache-affinity scores directly into a VCG-based auction ( V ickre y , 1961 ; Clarke , 1971 ). This ensures the market selects agents that are not only capable but also architecturally positioned (via acti ve KV -caches) to execute tasks ef ficiently . 3. System F ormulation W e consider a distributed LLM serving system consisting of a set of clients and a set of autonomous LLM agents. Let C = { 1 , . . . , N } denote the set of clients (prior agents from workflo ws) and S = { 1 , . . . , M } denote the set of serving agents. At a time slot, every client j ∈ C concurrently sub- mits a task characterized by a semantic context T j . Each agent i ∈ S is described by a model profile ( S i , K i ) , where S i denotes the model scale (e.g., parameter size or com- pute footprint) and K i captures its domain specialization. Agent i has a finite service capacity B i , representing the maximum number of concurrent tasks it can process, and maintains a local key–v alue (KV) cache that stores prefixes from previously processed conte xts. T o capture the benefit of conte xt reuse, we model the seman- tic ov erlap between task j and the cached state of agent i by a score o ij ∈ [0 , 1] . A higher o ij indicates greater KV reuse, which can significantly reduce inference latency and effecti ve computation cost. Accordingly , the service cost incurred by agent i when executing task j is modeled as C i = C i ( T j , S i , o ij ) , which decreases as cache ov erlap o ij increases. Each client deriv es utility from successful task ex ecution, which depends on both output quality and latency . Let P j ( T j , S i , K i ) denote the e xpected performance or qual- ity of serving task j using agent i , and let L j ( T j , S i , o ij ) denote the corresponding latency or TTFT . W e model the client’ s v aluation as a weighted combination of these f actors: v j = δ P j ( T j , S i , K i ) − (1 − δ ) L j ( T j , S i , o ij ) , (1) where δ ∈ [0 , 1] captures the client’ s relati ve preference between output quality and latency . T ask allocation is formulated as a man y-to-many matching problem and implemented via a double-auction mechanism. Let x ij ∈ { 0 , 1 } indicate whether task j is assigned to agent i . Feasible allocations must satisfy X i ∈S x ij ≤ 1 , ∀ j ∈ C , X j ∈C x ij ≤ B i , ∀ i ∈ S . (2) Each client has a priv ate v aluation v j , while each agent has a priv ate service cost or ask price c i . Giv en reported bids and asks, the auctioneer determines both the allocation { x ij } and the corresponding payments. Ignoring strategic behav- ior , an efficient allocation maximizes total social welfare: W = X i ∈S X j ∈C  v j − C i  x ij , (3) subject to the capacity and assignment constraints abov e. In each periodic auction round, clients and agents submit bids and asks, and the auctioneer applies the matching and pricing algorithm described in § 4.2 . A ke y challenge in this setting is that, unlike traditional distributed computing systems, service cost, latency , and performance for LLM-based agents are not deterministic functions of task and resource parameters ( Sun et al. , 2024 ; Chen et al. , 2025a ). Instead, these quantities exhibit sub- stantial v ariability due to f actors such as prompt structure, output length, and cache state. Consequently , prior to run- ning the auction, the platform maintains a predictive model M : ( T j , S i , K i , o ij ) 7− → ( C i , P j , L j ) , which estimates the distributions or expectations of cost, performance, and latency . Recent w ork on LLM routing and scheduling suggests that although these metrics are stochas- tic, they often follo w stable distributions conditioned on ( T j , S i , K i , o ij ) ( Chen et al. , 2025a ). The auction mecha- nism then operates on these predicted quantities and is de- signed to satisfy feasibility , as well as additional economic desiderata such as incentiv e compatibility and individual rationality . 4. IEMAS IEMAS is an integrated architecture designed to align eco- nomic incentives with system-lev el efficiency in the open Agentic W eb . Unlike prior work on request routing or agent selection that primarily considers scheduling a single task in isolation, large-scale LLM serving in open networks nat- urally induces a many-to-many matching problem between concurrent client requests and heterogeneous agents. In such settings, myopic per-request routing decisions can be socially suboptimal, as they fail to account for capacity constraints, cross-request interactions, and cache locality effects, as empirically demonstrated in § 5 . Follo wing prior 3 IEMAS: An Incentive-Efficiency Routing Framew ork for Open Agentic W eb Ecosystems Coarse Cluster Source Sink Query 1 Query 2 Query 3 Agent 1 Agent 2 1 , 0 1 , -1 1 , -3 1 , 0 1 , 2 1 , -2 2 , 0 1 , -1 1 , -1 2 , 0 1 , 0 Agent Hub 1 Proxy KV Cache Knowledge KV Cache Knowledge KV Cache Knowledge Agents Auction Holder Predict Model Prior Agent Result Query How do different design choices aff ect training expenses in transformers ? Previous Query What methods im prove computationa l efficiency in t raining large tra nsformer systems? How can we enhance stability during training of massi ve neural models ? How can we optimi ze training efficiency for large transforme r models? What are effect ive techniques to stabilize large model training processes? What factors infl uence the training cost of m odern transformer arc hitectures? Hub 2 (a) Coarse Cluster (b) Predict & Auction (c) Optimal Routing X , Y Flow Cost Query Predicted Metrics (Cost, Latency , Quality) Routing Optimizer (MCMF) Routing Optimizer Query Batch Hub 3 Signal Collector Query Ledger Agent Capab ilities F igure 2. IEMAS Overview . (a) Coarse-Grained Clustering: Incoming web queries are first allocated to specific Agent Hubs via a fast, domain-based clustering mechanism. (b) Predicti ve A uction: A proxy layer utilizes predictiv e modeling to generate uncertainty-aware bids/asks and ex ecutes an auction to match tasks to agents under capacity constraints. (c) Optimization: The allocation is solved as a Min-Cost Max-Flow (MCMF) problem to maximize social welf are based on truthful bidding. work, we introduce a proxy hub as a trusted mediator ( W ang et al. , 2025 ; Y ang et al. , 2025d ), which aggregates infor- mation across requests and agents and computes a joint allocation to achiev e socially efficient scheduling. The framew ork is organized as follo ws. § 4.1 introduces the cache-aware predicti ve layer that preserves KV -cache local- ity and derives calibrated QoS signals. § 4.2 presents a VCG- based mechanism that formulates task–agent assignment as a Min-Cost Max-Flow (MCMF) problem to maximize social welfare. § 4.3 analyzes key theoretical properties, in- cluding truthfulness, budget balance, and exactness. Finally , § 4.4 describes the Proxy Hub architecture, which improves scalability and enforces economic constraints by clustering agents and localizing matching. Figure 2 and Algorithm 1 provide an o vervie w of the complete workflow . 4.1. Resource-A ware Pr edictive Modeling IEMAS maintains a predicti ve model, which estimates these metrics(i.e., latency , cost, and quality) for e very candidate pair ( i, j ) to drive the auction described in § 4.2 . The module operates in an online loop: (1) calculate cache-locality , (2) pr edict QoS, (3) update via bandit feedbac k. Prefix-Locality (KV Reuse Pr oxy): T o estimate the cache affinity o ij , the proxy maintains a pr efix ledger for each agent i . This ledger stores the text ¯ p i,d of the last exe- cuted prompt for a specific dialogue session d . Given a new request j belonging to session d ( j ) with prompt p j , the proxy computes the Longest Common Prefix (LCP) length l ij = lcp( p j , ¯ p i,d ( j ) ) . The af finity score is defined as: o ij = l ij max(1 , | p j | ) ∈ [0 , 1] . (4) Because the ledger is agent-specific, switching agents for a multi-turn con versation results in o ij ≈ 0 , correctly captur- ing the loss of locality . QoS Online Prediction: IEMAS maintains an indepen- dent predictor g i for each agent i . Each latency and cost predictor uses Hoeffding T ree Regression, and performance predictor uses Hoef fding T ree Classifier . At decision time, we capture system load features: global router inflight/rate ( I ( r ) , R ( r ) ) and agent specific inflight/rate ( I i , R i ). W e com- pute normalized utilization u i = I i / max(1 , B i ) . For each pair ( i, j ) , we construct a feature vector: x ij =  | p j | , t j , ω ij , I ( r ) , R ( r ) , I i , R i , B i , u i , ξ j  , (5) where t j is the turn index and ξ j denotes metadata (e.g., domain tag). The Hoeffding T ree predictor outputs the estimates ( ˆ L ij , ˆ C ij , ˆ Q ij ) = g i ( x ij ) . T o reduce cold-start bias, IEMAS optionally performs a brief startup warm-up by issuing a small number of rep- resentativ e multi-turn dialogues to each agent to seed the predictors and establish initial cache state; latency labels during warm-up can be kept conserv ati ve to a v oid one-time initialization artifacts. Feedback Accounting: After ex ecution, the proxy records the observed latency L obs ij and computes the realized cost 4 IEMAS: An Incentive-Efficiency Routing Framew ork for Open Agentic W eb Ecosystems based on token usage. Let π miss i and π hit i be the prices for uncached (miss) and cached (hit) prompt tokens n , re- spectiv ely . The observed cost is( Bergemann et al. , 2025 ): C obs ij = π miss i ( n prompt j − n hit ij ) + π hit i n hit ij + π out i n gen j . (6) The predictor is updated online by the ( L obs ij , C obs ij , P obs ij ) , enabling the system to learn the agent’ s true performance characteristics and cache behavior o ver time. It is important to note that prior research has established robust predicti ve schemes for estimating LLM performance, latency and cost ( Sun et al. , 2024 ; Chen et al. , 2025a ; Feng et al. , 2025 ), as well as adv anced KV orchestration tech- niques for fine-gained utilization, such as CacheBlend ( Y ao et al. , 2025 ). Our work distinguishes itself by focusing on the mechanism design layer rather than lo w-level estimation or memory virtualization. IEMAS treats these system-level metrics as inputs to guide incentiv e-compatible auction val- uations. Consequently , our framew ork is modular: state-of- the-art predicti ve models or novel KV management tech- niques can be seamlessly coupled into the IEMAS proxy to further enhance auction precision and resource efficienc y . 4.2. Multi-Agent Matching Mechanisms Efficient allocation relies on clients truthfully re vealing their utilities to av oid mark et failure. Since agent service costs are transparently quantified by the proxy via Eq 6 , we assume agent costs are honest and restrict our incenti ve analysis to the strategic beha vior of clients. T o elicit truthful participation, IEMAS adopts a welfare- maximizing allocation rule with VCG-style payments. Giv en a set of client requests T j and a set of agents S , the proxy hub constructs candidate request–agent pairs ( i, j ) and deriv es scalarized valuations ˆ v i,j and costs ˆ c i,j from the predictive models described in § 4.1 . The net welfare contribution of assigning request j to agent j is denoted by w i,j = ˆ v i,j − ˆ c i,j , and pairs with w i,j < 0 are discarded to av oid inefficient matches. The proxy hub then solves the follo wing welfare- maximization problem: max x W ( C ) = X j ∈ C X i ∈ S w i,j x i,j (7) s.t. X i ∈ S x i,j ≤ 1 , ∀ j ∈ C , X j ∈ C x i,j ≤ q i , ∀ i ∈ S, x i,j ∈ { 0 , 1 } , where q i denotes the maximum number of concurrent re- quests agent i can serve. Problem ( 7 ) is a maximum-weight bipartite b -matching problem and admits a polynomial- time solution via a min-cost max-flo w (MCMF) reduction. Specifically , sho wn in Figure 2 (c), we construct a flo w net- work with a source connected to each task node with unit capacity , task-to-agent edges of unit capacity and cost − w i,j , and agent-to-sink edges with capacity q i . The edge owns positiv e cost will be discarded to av oid inef ficient matches. Any integral min-cost max-flow in this network corresponds to a feasible matching that minimizes total cost, and hence maximizes total welfare. Giv en the welfare-maximizing allocation, IEMAS imple- ments VCG (Clarke piv ot) payments to ensure incenti ve compatibility . Let W ( C ) denote the optimal welfare of ( 7 ) and W ( C \ { j } ) the optimal welf are when request j is remov ed. If client request j is matched to agent i in the optimal allocation, its VCG payment is defined as p j = W ( C \ { j } ) −  W ( C ) − w i,j  + c i,j , (8) which equals the externality that request j imposes on all other participants. This mechanism is ex-post efficient and dominant-strategy incentiv e compatible for clients. Sym- metric payments or rebates can be defined for agents when needed. While VCG mechanisms do not, in general, guarantee bud- get balance—the total payments collected from clients may differ from the total compensation paid to agents—the y pro- vide a principled benchmark for truthful, welfare-optimal allocation. Moreov er , although computing VCG pay- ments na ¨ ıvely requires resolving ( 7 ) once per request, in practice this ov erhead can be significantly reduced using warm-started min-cost flow solvers and incremental re- optimization within each proxy hub . 4.3. Algorithmic properties In this section, we pro vide a formal analysis of the eco- nomic and algorithmic properties of IEMAS. Specifically , we demonstrate that the coupling of the Min-Cost Max- Flow (MCMF) algorithm with the VCG payment rule is theoretically sound. W e pro ve that the flo w-based allo- cation is exact—a strict prerequisite for VCG truthful- ness—thereby resolving the potential conflict between com- putational tractability and incentiv e compatibility . Exactness of Allocation via MCMF: The v alidity of VCG relies strictly on the allocation rule being allocati vely ef fi- cient (i.e., finding the global maximum of the welfare func- tion). If the allocation algorithm were an approximation, the dominant-strategy equilibrium of VCG would collapse. W e sho w that our flow-based formulation is exact. Recall the welfare maximization problem defined in Eq. 7 . Let G = ( V , E ) be the constructed bipartite flow network. W e map the welfare weights w ij to edge costs such that cost ij = − w ij . 5 IEMAS: An Incentive-Efficiency Routing Framew ork for Open Agentic W eb Ecosystems Theorem 4.1 ( Allocati ve Efficiency ) . The assignment x ∗ pr oduced by the Min-Cost Max-Flow (MCMF) algorithm in the IEMAS flow network maximizes the total social welfar e W ( C ) subject to capacity constraints. Pr oof. See Appendix A.1 . Incentive Compatibility: Having established that MCMF yields the optimal allocation x ∗ , we now pro ve that the mechanism elicits truthful reporting of client value v i . Theorem 4.2 ( Dominant Strategy Incentiv e Compatibil- ity for Clients ) . Assuming truthful agents, r eporting the true valuation v i is a dominant str ate gy for every client i ∈ C under the IEMAS mechanism. Pr oof. See Appendix A.2 . Budget Balance. While standard VCG mechanisms are not generally budget-balanced, the specific asymmetry of our market design—where agent costs are verifiable via the proxy and clients are the primary strate gic actors—allo ws us to guarantee that the system ne ver runs a financial deficit. Theorem 4.3 ( W eak Budget Balance ) . The IEMAS mech- anism satisfies ex-post weak b udget balance . The total pay- ment collected fr om clients is sufficient to cover the total service costs incurr ed by ag ents: X j ∈C matched p j ≥ X i ∈S active c i . (9) Pr oof. See Appendix A.3 . Computational Consistency: While VCG theoretically re- quires resolving the optimization problem |S | + 1 times, the coupling with MCMF mitigates this ov erhead. The compu- tation of W ( C \ { i } ) is equiv alent to finding a minimum-cost flow adjustment on the r esidual graph G f obtained after the initial allocation x ∗ ( Ahuja et al. , 1993 ). By reusing the dual potentials (conceptually similar to Johnson’ s re-weighting algorithm) from the primary solution, the marginal cost of computing payments is significantly lo wer than solving from scratch ( Hershber ger & Suri , 2001 ). Furthermore, in bipartite assignment settings, it has been established that VCG payments can often be deri ved directly from the opti- mal dual variables of the linear relaxation, rendering the re- optimization step ef ficient or e ven instantaneous ( Leonard , 1983 ). This ensures the mechanism remains practical for real-time routing. 4.4. Agentic Hub Architectur e The system model described abov e faces two fundamen- tal challenges when deployed in large-scale, open LLM agent networks. First , realistic deployments may in volve hundreds or thousands of independently operated agents and a high volume of concurrent client requests. Per- forming global all-to-all prediction, auction, and schedul- ing—especially during the performance estimation and VCG matching phases—would incur prohibiti ve latency , communication overhead, and computational cost. Sec- ond , when agent heterogeneity is substantial, VCG-based mechanisms may violate indi vidual rationality (IR) con- straints ( Liu et al. , 2025 ), as implied by the Green–Laffont impossibility theorem, which precludes the simultaneous satisfaction of incenti ve compatibility , indi vidual rational- ity , budget balance, and allocati ve efficienc y in general set- tings ( Green & Laf font , 1977 ). W e demonstrate this ef fect in Appendix B.1 . T o address these limitations, prior work has sho wn that clustering-based decomposition is an ef fecti ve supplement to incentiv e mechanisms: it bounds problem size, enables parallel intra-cluster computation, and mitigates IR conflicts by reducing heterogeneity within each market ( Liu et al. , 2025 ). Building on this insight, IEMAS adopts a proxy- hub architecture to support scalable and incenti ve-aware scheduling in heterogeneous LLM ecosystems. Specifically , agents service pods are clustered a priori into multiple proxy hubs according to relativ ely static capability signals, such as model scale, domain specialization, and benchmarked performance (e.g., OpenCompass ev aluations ( Contributors , 2023 )). Incoming client requests are first routed to an ap- propriate hub using a lightweight, coarse-grained classifier based on task domain and quality-of-service requirements. Fine-grained IEMAS routing, prediction, and VCG-based matching are then ex ecuted locally within the selected hub . This two-stage routing and allocation process substantially reduces the dimensionality of the matching problem while preserving the economic ef ficiency benefits of incenti ve- aware scheduling. Each proxy hub acts as a mediation layer providing authen- tication, admission control, fine-grained accounting, and KV -cache management, thereby decoupling global market coordination from low-le vel inference ex ecution. Instead of continuous real-time communication, proxy hubs pe- riodically publish standardized, pri vacy-preserving meta- data—such as price signals, a vailable capacity , and compact cache-state summaries—which is sufficient for constructing efficient matchings while reducing bandwidth ov erhead and protecting proprietary model details. Furthermore, since the incentiv e mechanisms in § 4.2 operate on bipartite request– agent graphs, proxy hubs naturally host auction e xecution, VCG payment computation, and cache-aware scheduling. By maintaining local KV -prefix ledger in § 4.1 , proxies en- able reuse-aw are matching without e xposing raw prompts, model parameters, or inference traces. This separation of concerns allows IEMAS to achiev e scalability , economic robustness, cache ef ficiency , and priv acy-preserving dis- 6 IEMAS: An Incentive-Efficiency Routing Framew ork for Open Agentic W eb Ecosystems tributed inference at Internet scale. Implementation details are deferred to Appendix C . 5. Experiments W e ev aluate IEMAS on critical dimensions: (1) System Efficiency , measuring whether cache-aw are routing reduces latency and overhead; (2) Economic Robustness , verify- ing that the mechanism incenti vizes truthful reporting and maximizes social welfare. 5.1. Experimental Setup Experimental Setup. W e profiled the vLLM inference en- gine ( Kwon et al. , 2023 ) on heterogeneous nodes equipped with NVIDIA R TX 4090 and R TX 6000 GPUs. T o faith- fully simulate the resource constraints and frequent cache evictions typical of open agent networks, we fixed the concurrent query batch buf fer size at 12 and restricted the vLLM gpu memory utilization parameter to 0.6. Our agent population is instantiated with a di verse set of models, including LLaMA-3-7B ( Dubey et al. , 2024 ), Qwen-4B, and Qwen-8B ( Y ang et al. , 2025a ). W e e valuate system performance across three distinct interaction modal- ities: CoQA ( Reddy et al. , 2019 ) for multi-turn dialogue preservation, QuA C ( Choi et al. , 2018 ) for long-context handling, and HotpotQA ( Y ang et al. , 2018 ) for complex reasoning tasks. W e compare IEMAS against five routing strategies: • GraphRouter ( F eng et al. , 2025 ): a graph-based LLM router that models tasks, queries, and models as a hetero- geneous graph for effect/cost estimation. • GMTRouter ( Xie et al. , 2025 ): a personalized router using a heterogeneous graph to capture user–query–model interaction preferences. • MFRouter ( Ong et al. , 2025 ): a matrix factorization- based router that treats routing as a recommendation. • RouterDC ( Chen et al. , 2024 ): a dual-contrastive learn- ing router that fine-tunes a pre-trained language model to align query representations. • Random: routes queries uniformly at random as a non- learned baseline. 5.2. System Efficiency W e e valuate resource reuse ef ficiency via the KV -cache hit rate, Latency , and Service Cost, as detailed in T able 1 . KV Cache Reuse. On multi-turn datasets where context retention is critical, IEMAS demonstrates superior af finity scheduling. Specifically , on CoQA , IEMAS achieves a dom- inant cache hit rate of 80.2% , significantly outperforming the strongest baseline (GMTRouter at 53.1%). Even on long-context tasks lik e QuA C , IEMAS maintains a 63.6% hit rate compared to the baseline a verage of ∼ 30-40%. This confirms that our af finity-scoring mechanism successfully preserves conte xt locality ev en in complex dialogue scenar - ios. Latency Reduction. The impact of cache reuse on TTFT is e vident in the Lat column. On the HotpotQA reasoning benchmark, IEMAS reduces median latenc y to 284.2ms , av oiding the sev ere congestion observed in RouterDC (2139.8ms) and outperforming GMTRouter (372.0ms). Sim- ilarly , on QuA C , IEMAS achie ves the lo west latency of 162.1ms , a 1.9 × speedup o ver MFRouter (306.2ms), by consistently routing follow-up queries to agents holding the activ e KV -prefix and bypassing the prefill phase. Cost Efficiency . Maximizing cache hit rates directly trans- lates to economic sa vings. On CoQA, IEMAS yields an av erage cost of 6.944 , representing a ≈ 35% r eduction compared to the ne xt most ef ficient method (MFRouter at 10.507) and a massiv e reduction compared to Random rout- ing (13.65). While HotpotQA exhibits lo wer ov erall cache reuse (0.089) due to task nature, IEMAS still achie v es the lowest cost ( 28.694 ) among all methods, proving that the VCG-based mechanism effecti vely optimizes for cost e ven when cache opportunities are scarce. 5.3. Predicti ve Accuracy The efficienc y of the auction mechanism hinges on the Proxy Hub’ s ability to accurately calibrate agent Latenc y ( L ), Cost ( C ), and Accuracy ( P ). As illustrated in Figure 3 , our online Hoeffding T ree predictor demonstrates robust predicti ve ca- pability across multi-turn interactions. The figure compares observed versus predicted values ov er 20 turns, re vealing tight alignment between the model’ s estimates and ground truth. Specifically , the model achieves a low Normalized Mean Absolute Error (NMAE) of 0.101 for latency and 0.090 for cost, effecti vely capturing the non-linear increase in resource consumption as conte xt length gro ws. Further- more, accuracy prediction remains stable with an NMAE of 0.069 , acting as a reliable expected value filter against the stochastic noise of observed performance, ensuring the auction mechanism operates on calibrated risk bounds. 5.4. Economic Analysis: T ruthfulness and W elfar e A core contribution of IEMAS is Incentiv e Compatibility (IC). W e v alidate this by introducing strategic agents who attempt to game the system. T ruthfulness V alidation. W e simulate four agent bidding strategy: Honest, aggressive, conserv ative and random who report truthful, consistly high/low and random based on the true value. Figure 5 illustrates the cumulati ve utility (profit) ov er 100 auction rounds. Under the VCG-based payment rule, truthful agents see steady profit growth. In contrast, the other strategic agents incur penalties due to performance 7 IEMAS: An Incentive-Efficiency Routing Framew ork for Open Agentic W eb Ecosystems Method CoQA (Multi-turn) QuA C (Long Context) HotpotQA (Reasoning) KV (%) Cost Lat (ms) KV (%) Cost Lat (ms) KV (%) Cost Lat (ms) GraphRouter 0.364 10.655 423.2 0.310 17.404 443.1 0.042 42.243 1470.3 GMTRouter 0.531 11.946 342.2 0.319 25.302 472.6 0.048 44.365 372.0 MFRouter 0.384 10.507 414.5 0.488 15.372 306.2 0.054 39.101 389.2 RouterDC 0.482 9.527 357.5 0.242 19.443 404.9 0.046 34.180 2139.8 Random 0.165 13.65 452.6 0.343 24.927 189.1 0.044 43.683 404.4 IEMAS (Ours) 0.802 6.944 354.5 0.636 15.868 162.1 0.089 28.694 284.2 T able 1. A verage System Ef ficiency Comparison across three benchmarks. KV (%) denotes cache hit rate ( ↑ ) and Lat is the TTFT ( ↓ ). 4 8 12 16 20 T urn Number 200 400 600 Latency (ms) N M A E t u r n = 0 . 0 9 8 4 8 12 16 20 T urn Number 5.0 7.5 10.0 12.5 Cost (USD/M token) N M A E t u r n = 0 . 0 9 6 4 8 12 16 20 T urn Number 0.00 0.25 0.50 0.75 1.00 Accuracy N M A E t u r n = 0 . 0 4 0 Observed Pr edicted O b s e r v e d p 2 5 p 7 5 P r e d i c t e d p 2 5 p 7 5 F igure 3. The Predictive Model for QoS F actors of CoQA dataset. 0 25 50 75 100 125 150 Dialogue T urns 0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 Cumulative W elfar e × 1 0 4 IEMAS GMT -Router Router -DC MF-Router Graph-Router Random F igure 4. Social W elfare Comparison.                    F igur e 5. The Utility under Different Auction Strategies in VCG Auction. 1 2 5 10 20 Number of Clusters (Granularity) 1 0 3 1 0 4 1 0 5 VCG Calculation Cost (ms) Calculation Cost Norm. W elfare 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 Normalized Social W elfare F igure 6. The Social W elfare and T ime Cost under Different Clus- ter Lev els in Routing. deviations, resulting in strictly lo w profit at any round. This empirically demonstrates that truth-telling is a dominant strategy in IEMAS. Clustering T rade-off Analysis. T o quantify the balance between computational scalability and allocativ e ef ficiency , we conducted a sensiti vity analysis by v arying the number of proxy hub clusters K . W e maintained a constant global en vironment of M = 100 agents and N = 200 concur- rent tasks. As illustrated in Figure 6 , increasing K leads to a sharp reduction in solv er latency , as the comple xity of the Min-Cost Max-Flow (MCMF) algorithm scales super - linearly with problem size. Crucially , this performance gain incurs only a marginal de gradation in global social welfare. The results confirm that domain-based clustering ef fecti vely partitions the market, significantly reducing scheduling o ver - head while maintaining near -optimal allocation ef ficiency . W e further analysis the ef fect of dif ferent clustering schemes in Appendix B.1 . Social W elfar e. W e define Social W elf are as the cumulati ve sum of client utility minus agent costs o ver the session du- ration. Figure 4 illustrates the welfare accumulation across 160 dialogue turns. IEMAS consistently maintains the steep- est gro wth trajectory , demonstrating superior long-term ef- ficiency . While state-of-the-art baselines like GMT -Router and Router-DC track closely , IEMAS maintains a sustained lead, e xceeding all baselines by the final turn. This gap high- lights the adv antage of our incenti ve-compatible mechanism: by explicitly pricing cache affinity in the VCG valuation, IEMAS minimizes the cost of context that degrades the net utility of other routing strategies over long horizons. In contrast, the Random baseline fails to generate meaningful welfare, v alidating the necessity of intelligent coordination. 6. Conclusion The transition to an open Inter net of Agents requires rec- onciling the strategic autonomy of decentralized pro viders with the physical realities of ef ficient LLM inference. In this work, we introduced IEMAS , a routing frame work that bridges this gap by treating the KV -cache not merely as a system buff er , but as a pricable economic asset. By co- designing a resource-aware predicti ve model with a truthful VCG mechanism, IEMAS solv es the fundamental tension between maximizing global social welfare and ensuring individual rationality . Our theoretical analysis confirms that the mechanism guarantees truthful reporting and weak budget balance. Empirically , extensi ve simulations demon- strate that this alignment yields tangible performance gains: IEMAS achieves an 80% KV -cache hit rate, reduces av- erage service cost by 35% , and lo wers end-to-end latenc y by 2.9 × compared to state-of-the-art baselines. These re- sults suggest that rigorous mechanism design, when tightly coupled with system-lev el observ ables, is a prerequisite for scaling the future Agentic W eb . 8 IEMAS: An Incentive-Efficiency Routing Framew ork for Open Agentic W eb Ecosystems Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. References Ahuja, R. K., Magnanti, T . L., and Orlin, J. B. Network flows: theory , algorithms, and applications . Prentice hall, 1993. Aminiranjbar , Z., T ang, J., W ang, Q., Pant, S., and V iswanathan, M. Dawn: Designing distributed agents in a worldwide network. IEEE Access , 2025. Bansal, G., Hua, W ., Huang, Z., Fourney , A., Swearngin, A., Epperson, W ., Payne, T ., Hofman, J. M., Lucier , B., Singh, C., et al. Magentic marketplace: An open-source en vironment for studying agentic markets. arXiv preprint arXiv:2510.25779 , 2025. Bergemann, D., Bonatti, A., and Smolin, A. The economics of large language models: T oken allocation, fine-tuning, and optimal pricing. In Pr oceedings of the 26th A CM Confer ence on Economics and Computation , pp. 786– 786, 2025. Chen, J., Shi, J., Chen, Q., and Guo, M. Kairos: Low-latency multi-agent serving with shared llms and excessi ve loads in the public cloud. arXiv pr eprint arXiv:2508.06948 , 2025a. Chen, S., Jiang, W ., Lin, B., Kwok, J., and Zhang, Y . Rou- terdc: Query-based router by dual contrastiv e learning for assembling large language models. In Advances in Neural Information Pr ocessing Systems (NeurIPS) , 2024. Chen, W ., Y ou, Z., Li, R., Qian, C., Zhao, C., Y ang, C., Xie, R., Liu, Z., Sun, M., et al. Internet of agents: W ea ving a web of heterogeneous agents for collaborative intelli- gence. In The Thirteenth International Conference on Learning Repr esentations . Chen, W ., Y ou, Z., Li, R., Qian, C., Zhao, C., Y ang, R., et al. Internet of agents: W ea ving a web of heterogeneous agents for collaborati ve intelligence. In International Confer ence on Learning Repr esentations (ICLR) , 2025b. Choi, E., He, H., Iyyer , M., Y atskar , M., Y ih, W .-t., Choi, Y ., Liang, P ., and Zettlemoyer , L. Quac: Question answering in context. arXiv pr eprint arXiv:1808.07036 , 2018. Clarke, E. H. Multipart pricing of public goods. Public choice , 11(1):17–33, 1971. Contributors, O. Opencompass: A uni versal ev aluation platform for foundation models. https://github. com/open- compass/opencompass , 2023. Dubey , A., Jauhri, A., Pandey , A., Kadian, A., Al-Dahle, A., Letman, A., Mathur , A., Schelten, A., Y ang, A., F an, A., et al. The llama 3 herd of models. arXiv e-prints , pp. arXiv–2407, 2024. Feng, T ., Shen, Y ., and Y ou, J. Graphrouter: A graph-based router for llm selections. In ICLR , 2025. Gim, I., Chen, G., Lee, S.-s., Sitaraman, R., W ang, J., and W ei, G.-Y . Prompt cache: Modular attention reuse for low-latenc y inference. arXiv pr eprint arXiv:2311.04934 , 2024. Green, J. and Laffont, J.-J. Characterization of satisfactory mechanisms for the re velation of preferences for pub- lic goods. Econometrica: J ournal of the Econometric Society , pp. 427–438, 1977. Guo, T ., Chen, X., W ang, Y ., Chang, R., Pei, S., Chawla, N. V ., Wiest, O., and Zhang, X. Large language model based multi-agents: A surve y of progress and challenges. In IJCAI , 2024. Guo, Z., Zhou, Y ., W ang, C., Y ou, L., Bian, M., and Zhang, W . Beta web: T o wards a blockchain-enabled trustworthy agentic web . arXiv preprint , 2025. Hershberger , J. and Suri, S. V ickrey prices and shortest paths: What is an edge worth? In Pr oceedings of the 42nd IEEE Symposium on F oundations of Computer Science , pp. 252–259. IEEE, 2001. Hoffman, A. J. and Kruskal, J. B. Integral boundary points of con vex polyhedra. In Linear Inequalities and Related Systems , volume 38 of Annals of Mathematics Studies , pp. 223–246. Princeton Univ ersity Press, 1956. Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y ., W ang, J., Zhang, C., W ang, Z., Y au, S. K. S., Lin, Z., Zhou, L., Ran, C., Xiao, L., W u, C., and Schmidhuber, J. Metagpt: Meta programming for a multi-agent collabo- rativ e framew ork. In The T welfth International Confer- ence on Learning Repr esentations , 2024. URL https: //openreview.net/forum?id=VtmBAGCN7o . Kwon, W ., Li, Z., Zhuang, S., Sheng, Y ., Zheng, L., Y u, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for lar ge language model serving with pagedattention. In Pr oceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP ’23) , pp. XXX–XXX, 2023. doi: 10.1145/3600006.3613165. URL . 9 IEMAS: An Incentive-Efficiency Routing Framew ork for Open Agentic W eb Ecosystems Leonard, H. B. Elicitation of honest preferences for the as- signment of individuals to positions. J ournal of P olitical Economy , 91(3):461–479, 1983. Li, G., Hammoud, H. A. A. K., Itani, H., Khizanishvili, D., and Ghanem, B. Camel: Communicativ e agents for ”mind” e xploration of lar ge language model society . In Advances in Neural Information Pr ocessing Systems (NeurIPS) , 2023. Liu, H., W ang, M., Liu, J., Y uan, S., Lou, J., W u, C., and Li, J. Cast: Cluster-dri ven truthful cro wdfunding mechanism for shared ai service deployment. IEEE T ransactions on Services Computing , 2025. Microsoft. Deepspeed-mii: High-performance inference for large models, 2023. URL https://github.com/ microsoft/DeepSpeed- MII . Ong, I., Amouyal, K., Levine, Y ., et al. Routellm: Learn- ing to route llms with preference data. In International Confer ence on Learning Repr esentations (ICLR) , 2025. Pan, Z., P A TEL, A., Shen, Y ., Hu, Z., Guan, Y ., Li, W .-L., Qin, L., W ang, Y ., and Ding, Y . Kvflow: Efficient prefix caching for accelerating llm-based multi-agent workflows. In The Thirty-ninth Annual Confer ence on Neural Infor - mation Pr ocessing Systems . Patel, P ., Choukse, E., Zhang, C., Shah, A., Neiswanger , W ., and W ang, Y . Splitwise: Efficient generati ve llm in- ference using phase splitting. In Pr oceedings of the 51st Annual International Symposium on Computer Ar chitec- tur e (ISCA) , 2024. Qiu, H., Zhu, K., Luong, N. C., Y i, C., Niyato, D., and Kim, D. I. Applications of auction and mechanism design in edge computing: A survey . IEEE T ransactions on Cognitive Communications and Networking , 8(2):1034– 1058, 2022. Reddy , S., Chen, D., and Manning, C. D. Coqa: A con versa- tional question answering challenge. T ransactions of the Association for Computational Linguistics , 7:249–266, 2019. Schrijver , A. Combinatorial Optimization: P olyhedra and Efficiency , volume 24. Springer , Berlin, Heidelberg, 2003. Sheng, Y ., Cao, S., Li, D., Hooper , C., Lee, N., Y ang, S., Chou, C., Bang, B., Tre viso, D., Gonzalez, J. E., et al. S-lora: Serving thousands of concurrent lora adapters. arXiv pr eprint arXiv:2311.03285 , 2023. Sun, B., Huang, Z., Zhao, H., Xiao, W ., Zhang, X., Li, Y ., and Lin, W . Llumnix: Dynamic scheduling for large language model serving. In 18th USENIX symposium on operating systems design and implementation (OSDI 24) , pp. 173–191, 2024. V ickrey , W . Counterspeculation, auctions, and competitive sealed tenders. The Journal of finance , 16(1):8–37, 1961. W ang, Y ., Su, Z., P an, Y ., Luan, T . H., Li, R., and Y u, S. Social-aw are clustered federated learning with cus- tomized pri v acy preserv ation. IEEE/ACM T ransactions on Networking , 2024. W ang, Y ., Guo, S., Pan, Y ., Su, Z., Chen, F ., Luan, T . H., Li, P ., Kang, J., and Niyato, D. Internet of agents: Funda- mentals, applications, and challenges. IEEE T ransactions on Cognitive Communications and Networking , 2025. doi: 10.1109/TCCN.2025.3623369. URL https:// ieeexplore.ieee.org/document/10834360 . W u, T . et al. Autogen: Enabling next-gen llm appli- cations via multi-agent framework. arXiv pr eprint arXiv:2308.08155 , 2023. Xie, E., Sun, Y ., Feng, T ., and Y ou, J. Gmtrouter: Personal- ized llm router ov er multi-turn user interactions. arXiv pr eprint arXiv:2511.08590 , 2025. Xie, T ., Zhou, F ., Cheng, Z., Shi, P ., W eng, L., Liu, Y ., Hua, T . J., Zhao, J., Liu, Q., et al. Openagents: An open platform for language agents in the wild. arXiv pr eprint arXiv:2310.10634 , 2023. Y ang, A., Li, A., Y ang, B., Zhang, B., Hui, B., Zheng, B., Y u, B., Gao, C., Huang, C., Lv , C., et al. Qwen3 technical report. arXiv pr eprint arXiv:2505.09388 , 2025a. Y ang, Y ., Chai, H., Shao, S., Song, Y ., Qi, S., Rui, R., and Zhang, W . Agentnet: Decentralized evolutionary coordination for llm-based multi-agent systems. arXiv pr eprint arXiv:2504.00587 , 2025b. Y ang, Y ., Ma, M., Huang, Y ., Chai, H., Gong, C., Geng, H., Zhou, Y ., W en, Y ., Fang, M., Chen, M., et al. Agentic web: W eaving the next web with ai agents. arXiv pr eprint arXiv:2507.21206 , 2025c. Y ang, Y ., W en, Y ., W ang, J., and Zhang, W . Agent e xchange: Shaping the future of ai agent economics. arXiv pr eprint arXiv:2507.03904 , 2025d. Y ang, Z., Qi, P ., Zhang, S., Bengio, Y ., Cohen, W ., Salakhut- dinov , R., and Manning, C. D. Hotpotqa: A dataset for div erse, explainable multi-hop question answering. In Pr oceedings of the 2018 confer ence on empirical methods in natural languag e pr ocessing , pp. 2369–2380, 2018. Y ao, J., Li, H., Liu, Y ., Ray , S., Cheng, Y ., Zhang, Q., Du, K., Lu, S., and Jiang, J. Cacheblend: Fast large language model serving for rag with cached kno wledge fusion. In 10 IEMAS: An Incentive-Efficiency Routing Framew ork for Open Agentic W eb Ecosystems Pr oceedings of the T wentieth Eur opean Confer ence on Computer Systems , pp. 94–109, 2025. Y e, D., Cai, S., Du, H., Kang, J., Liu, Y ., Y u, R., and Niyato, D. Optimizing aigc services by prompt engineering and edge computing: A generati ve diffusion model-based con- tract theory approach. IEEE T ransactions on V ehicular T echnology , 2024. Y ou, F ., Y uan, X., Ni, W ., and Jamalipour , A. Priv ac y- preserving multi-agent deep reinforcement learning for ef fectiv e resource auction in multi-access edge computing. IEEE T ransactions on Cognitive Communications and Networking , 2024. Y u, G.-I., Jeong, J. S., Kim, G.-W ., Kim, S., and Chun, B.- G. Orca: A distributed serving system for T ransformer- based generativ e models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) , pp. 521–538, 2022. Zheng, L., Li, L., Zhang, H., Zhuang, S., Chen, Z., Huang, Y ., Gonzalez, J. E., and Stoica, I. Sglang: Ef ficient ex- ecution of structured language model programs. arXiv pr eprint arXiv:2312.07104 , 2023. Zhong, Y ., T ong, Y ., Kang, J., Dai, M., Dai, H.-N., Su, Z., and Niyato, D. Hybrid stackelberg game and diffusion- based auction for two-tier agentic ai task offloading in in- ternet of agents. arXiv pr eprint arXiv:2511.22076 , 2025. 11 IEMAS: An Incentive-Efficiency Routing Framew ork for Open Agentic W eb Ecosystems A. Proof A.1. Efficiency The MCMF algorithm minimizes the total cost function on the network: min f X ( u,v ) ∈ E cost uv · f uv ⇐ ⇒ max x X i ∈S X j ∈C w ij x ij (10) The constraint matrix A of this bipartite matching problem corresponds to the node-edge incidence matrix. The rows represent flow conservation constraints at T ask nodes j (capacity 1) and Agent nodes i (capacity B i ). By the Hoffman- Kruskal Theorem ( Hoffman & Kruskal , 1956 ), the incidence matrix of a bipartite graph is T otally Unimodular (TU). A fundamental property of TU matrices is that for any integral capacity vector b (here, 1 and B i are integers), the polyhedron of feasible solutions P = { x | Ax ≤ b, x ≥ 0 } has integral vertices ( Schrijver , 2003 ). Consequently , the solution found by the MCMF algorithm—which solv es the linear relaxation of the problem—is guaranteed to be integral ( x ∗ ij ∈ { 0 , 1 } ). Since MCMF provides an e xact solution to the minimum cost circulation problem, the resulting allocation x ∗ is the global maximizer of social welfare. A.2. T ruthfulness Let u j ( ˆ v j , v − j ) denote the utility of client i when reporting valuation ˆ v j , while other clients report v − j . The utility is defined as the realized value minus the payment: u j = v j − p j (11) Substitute the VCG payment rule from Eq. ( 8 ) into the utility function. Note that W ( C ) depends on the reported ˆ v j , so we expand W ( C ) = ( ˆ v j − c i,j ) + P k  = j w ∗ k,s k , where the sum represents the optimal welfare of all other matched pairs in the presence of j . u j = v j − [ W ( C \ { j } ) − ( W ( C ) − ( ˆ v j − c i,j )) + c i,j ] = v j − W ( C \ { j } ) + W ( C ) − ˆ v j + c i,j − c i,j = ( v j − ˆ v j ) + W ( C ) − W ( C \ { j } ) (12) If the client reports truthfully such that ˆ v j = v j , the term ( v j − ˆ v j ) v anishes: u j = W ( C ) | {z } T otal Social W elfare − W ( C \ { j } ) | {z } Constant independent of j (13) The term W ( C \ { j } ) represents the system welfare if client j were absent, which is independent of client j ’ s strategy . Thus, maximizing individual utility u j is mathematically equiv alent to maximizing the total social welfare W ( C ) . Since the MCMF algorithm finds the exact allocation that maximizes W ( C ) based on reported values, the client maximizes their own utility u j if and only if they provide the true valuation v j that allows the mechanism to optimize the true social welfare. Under-reporting or o ver -reporting v j may lead to a suboptimal allocation (or no allocation) that yields a strictly lower utility . A.3. Budget Balance Consider a single matched transaction between client j and agent i . The platform collects p j from the client and compensates the agent c i,j . The platform’ s net surplus for this transaction is: ∆ j = p j − c i,j Using Eq. ( 8 ): ∆ j = [ W ( C \ { j } ) − ( W ( C ) − w i,j ) + c i,j ] − c i,j ∆ j = W ( C \ { j } ) −  W ( C ) − w i,j  (14) Here, W ( C \ { j } ) represents the maximum social welfare achie vable by the remaining participants without client j . The term ( W ( C ) − w i,j ) represents the actual welfare of the remaining participants in the optimal allocation with client j present. Since agent capacity is finite (resource contention), the presence of client j can only decrease or lea ve unchanged the aggregate welf are av ailable to others (by consuming a slot that might hav e gone to another high-v alue task). It cannot increase the welfare of others. Therefore: W ( C \ { j } ) ≥ W ( C ) − w i,j = ⇒ ∆ j ≥ 0 . (15) 12 IEMAS: An Incentive-Efficiency Routing Framew ork for Open Agentic W eb Ecosystems Since the surplus ∆ j is non-negati ve for ev ery matched request, the sum ov er all requests is non-neg ativ e. Remark: The non-negati ve surplus P ∆ i represents the ”information rent” or externality tax collected by the protocol. In the IEMAS implementation, this surplus can be retained by the Hub as a maintenance fee or redistributed to agents as a long-term participation incentiv e (e.g., staking rewards) without violating the b udget constraint. B. Extended Analysis B.1. Cluster Strategy Effect Analysis As discussed in Section 4.2, IEMAS employs the VCG mechanism to ensure incenti ve compatibility . While theoretically robust, the computational cost of calculating Clarke pi v ot payments is substantial. Specifically , calculating the e xternality for each request requires resolving the Min-Cost Max-Flow (MCMF) problem N times for N concurrent requests. In a global, flat market with thousands of agents, this O ( N · T M C M F ) complexity becomes prohibiti v e for real-time inference routing. T o address this, Section 4.4 introduces a hierarchical Proxy Hub architecture. While clustering agents reduces the search space—potentially lowering the theoretical global maximum social welfare (SW)—our analysis shows that the specific characteristics of LLM inference mitigate this loss. Market Fragmentation vs. Cache Locality . In generic resource allocation, partitioning agents into disjoint clusters restricts the solution space, creating a ”Fragmentation Penalty” where a task cannot access an idle agent in a different cluster . Howe ver , LLM inference ef ficiency is dominated by KV -cache locality . Global routing often suf fers from conte xt thrashing, where semantically similar queries are dispersed. By clustering agents based on domain specialization (as v erified in Appendix A.1), the system maximizes intra-cluster cache reuse ( o ij ). This gain in ex ecution ef ficiency (lo wer cost C i and latency L j ) effecti vely offsets the loss in matching optionality .      0.0 0.2 0.4 0.6 0.8 1.0                           0 10 20 30 40        F igure 7. The Economics Performance under Different Cluster Schemes. Full-Mix is a heterogeneous baseline with no prior alignment between tasks and agents. Ideal represents homogeneous alignment, where tasks and agents are pre-clustered to minimize matching entropy . T ask-Mix clusters agents by specialization while tasks remain heterogeneous, whereas Agent-Mix clusters tasks while agents remain heterogeneous, assessing how each side absorbs structural mismatch. Figure 7 shows that reasonable cluster can cause ideal situation, which can maintain high social benefits and achiev e similar results as Full-Mix. Howe ver one-sided centric clustering method such as T ask-Mix or Agent-Mix can cause task congestion and cutthroat competition as high-level agent service resource are relatively limited, which can reduce social welf are, increase negati ve client and influence IR. 13 IEMAS: An Incentive-Efficiency Routing Framew ork for Open Agentic W eb Ecosystems C. System Integration and Implement F igure 8. The System Structure of IEMAS This appendix details the software architecture and implementation logic of IEMAS. The system is designed as an asynchronous, e vent-dri ven proxy service that acts as an intelligent router between a high-throughput client load generator and a heterogeneous pool of Large Language Model (LLM) back ends. The implementation relies primarily on Python’ s asyncio for concurrency and FastAPI for the web service layer . The architecture is di vided into three primary functional blocks: the Client (load generation), the Router (state management, prediction, and allocation), and the Backend Interface (protocol translation and telemetry). C.1. Client Implementation The client module ( iemas.client ) implements a pure load generator designed to stress-test the routing infrastructure using the CoQA dataset. State Management and Concurr ency . The client executes multiple dialogues concurrently using asyncio.gather . W ithin each dialogue task ( run dialogue ), turns are e xecuted sequentially to preserv e con versational causality—T urn N must complete before T urn N + 1 begins. The client maintains a local con versation history list, appending user questions and assistant answers v erbatim as the y occur . Concurrency is controlled via a global asyncio.Semaphore ( max concurrency ), ensuring a fixed upper bound on the number of simultaneous acti ve requests. Protocol and T racing . Unlike standard benchmarks that might maintain a persistent connection, the client initiates a fresh HTTP POST request for every turn to the router’ s /v1/chat/completions endpoint. T o enable the router to track sessions across these stateless HTTP requests, the client injects custom tracing headers: • X-IEMAS-RUN-ID : A unique identifier for the experiment run. • X-IEMAS-DIALOGUE-ID : The unique ID of the current con versation. • X-IEMAS-TURN-NUMBER : The monotonic turn index (1-based). • X-IEMAS-SOURCE : The prov enance of the dialogue text. 14 IEMAS: An Incentive-Efficiency Routing Framew ork for Open Agentic W eb Ecosystems C.2. Router Architectur e The router ( iemas.router ) acts as the central decision-making entity . It does not perform inference itself but routes requests to backend endpoints defined in a Y AML configuration file. C . 2 . 1 . A S Y N C H RO N O U S M I C R O - B AT C H I N G T o enable collectiv e decision-making (such as auctions) rather than greedy per-request routing, the router imple- ments a MicroBatcher component. Incoming requests are not processed immediately; they are wrapped in a PendingChatCompletion object containing the request body and an unresolved asyncio.Future . These pending objects are submitted to the MicroBatcher , which buf fers them into an asyncio.Queue . The batcher yields a batch for processing when one of two conditions is met: 1. Size Threshold: The queue size reaches max batch size (e.g., 16 requests). 2. Time Threshold: The oldest request in the queue has waited for max wait ms (e.g., 10ms). This mechanism ensures that the o verhead of batching remains bounded by a tight latency budget. The batcher runs as a background task ( run loop ) and in vokes a callback handler ( handle chat batch ) for e very emitted batch. C . 2 . 2 . P R E FI X C A C H I N G A N D S TA T E T R A C K I N G Effecti ve routing requires estimating the state of the K ey-V alue (KV) cache at each back end without direct access to backend memory . The router maintains a local TextPrefixCache . Data Structure. The cache is a dictionary mapping the tuple (backend id, model, dialogue id) to the full text of the prompt used in the pr evious turn. Featur e Extraction. For ev ery request in a micro-batch, the router computes the Longest Common Prefix (LCP) between the current request’ s serialized prompt and the cached text for each candidate back end. This computation yields a ratio (LCP length / prompt length) and lcp chars , which serve as proxy features for the back end’ s actual KV cache hit rate. Eviction Heuristic. The router implements a heuristic eviction policy ( should evict router prefix cache ). If a backend reports zero cached tok ens in its usage statistics despite a high router -side prefix match, the router infers that the backend has e victed the KV cache and in validates its o wn local record to resync state. C . 2 . 3 . O N L I N E P R E D I C T I O N S U B S Y S T E M The system employs an AsyncBackendPredictorPool to manage online learning models. T o pre vent interference between the learning processes of dif ferent backends, the system maintains a separate, independent AgentPredictor instance for each backend. Featur e Engineering. The PredictorInput dataclass aggregates features at routing time. These include: • Request Features: Prompt length (characters), turn number . • Cache Features: The computed kvmatch text ratio. • System Load: Router-wide inflight requests and global Requests Per Second (RPS). • Local Load: Back end-specific metrics including backend inflight , backend rps , and utilization (inflight divided by capacity). Model Updates. The predictors utilize Hoeffding T rees ( HoeffdingTreeRegressor , HoeffdingTreeClassifier ) from the river library . The y support partial fit via the learn one method, allowing the system to update the model incrementally immediately after a request completes. 15 IEMAS: An Incentive-Efficiency Routing Framew ork for Open Agentic W eb Ecosystems C . 2 . 4 . A U C T I O N - B A S E D R O U T I N G L O G I C When the routing policy is set to auction , the router e xecutes the select backends auction function. Graph Construction. The allocation problem is modeled as a Min-Cost Max-Flo w (MCMF) network flo w graph: • A Source node connects to all Request nodes with capacity 1 and cost 0. • Request nodes connect to Backend nodes. An edge exists only if the calculated welfare for that assignment is positive. The capacity is 1, and the cost is set to negati ve welfare (since standard solvers minimize cost). • Backend nodes connect to a Sink node with capacity equal to the backend’ s av ailable concurrency slots. Solver . The router includes a custom, dependency-free implementation of the Successi ve Shortest Path algorithm ( iemas.router.auction.mcmf ). It uses Bellman-Ford potentials to handle ne gati ve edge costs and Dijkstra’ s algorithm for finding augmenting paths. VCG Payment Calculation. T o ensure economic robustness, the system calculates V ickrey-Clarke-Grov es (VCG) payments. This requires calculating the ”e xternality” a request imposes on others. For a batch of size N , the router runs the MCMF solver N + 1 times: once to find the optimal allocation, and once for each request i (removing i from the graph) to calculate the counterfactual welfare. C . 2 . 5 . P E R F O R M A N C E E V A L UAT I O N T o pro vide a ground-truth signal for the performance predictor , the router ev aluates response correctness asynchronously using the PerformanceEvaluator protocol. T w o implementations are provided: • RougeCoqaEvaluator : Computes R OUGE-1/2/L F1 scores against the dataset gold answer . • T okenSpanCoqaEvaluator : A deterministic e valuator that normalizes the text (lo wercasing, number normalization) and checks if the gold answer tokens appear as a contiguous subsequence in the model output. C.3. Backend Interface and Deployment The IEMAS router abstracts the underlying inference engines through a unified protocol layer defined in iemas.router.components.backend . This layer manages connection pooling, protocol translation, and high- fidelity telemetry injection. While the system is agnostic to the specific serving frame work, our reference implementation deploys high-throughput vLLM instances. C . 3 . 1 . T H E H T T P O P E N AIB A C K E N D A B S T R AC T I O N The core component interacting with model servers is the HttpOpenAIBackend class. This class wraps an asynchronous httpx.AsyncClient to manage persistent HTTP/1.1 connections to the backend’ s /v1/chat/completions end- point. A uthentication and Configuration. Each backend instance is initialized with a base url v1 and an optional api key . The backend automatically injects the Authorization: Bearer header into e very outbound request. The configuration also supports distinct pricing models for input, cached-input, and output tokens, which are normalized into a unified cost proxy during initialization. C . 3 . 2 . T E L E M E T RY I N J E C T I O N V I A F O R C E D S T R E A M I N G A critical requirement for the router is to measure Time-T o-First-T oken (TTFT) latency to train its prediction models, ev en when the client requests a standard non-streaming JSON response. T o achie ve this without modifying the standard OpenAI protocol, we implement a side-channel measurement technique within forward chat completions . Protocol T ranslation Logic: When the router processes a request, it modifies the payload before forwarding it to the backend: 16 IEMAS: An Incentive-Efficiency Routing Framew ork for Open Agentic W eb Ecosystems 1. For ced Streaming: The router explicitly sets stream=True in the request body , reg ardless of the client’ s original preference. 2. Stream Options: It injects stream options= { "include usage": True } to ensure that the backend sends token usage statistics (prompt processing and completion tokens) at the end of the stream. Stream Consumption and Measur ement: The backend consumes the resulting Server -Sent Events (SSE) stream using an asynchronous iterator . • TTFT Measurement: The router records the monotonic timestamp ( T start ) immediately before sending the request. It records a second timestamp ( T f irst ) upon receiving the first v alid SSE data chunk containing a content delta. • Response Reconstruction: Because the client may expect a single JSON object, the router employs a StreamReconstruction class. This component aggregates the incoming choices[].delta.content fragments into a buf fered ChoiceAccum object. • Final Assembly: Once the stream terminates (receiving [DONE] ), the reconstructor builds a standard ChatCompletion JSON object, populating the usage field from the final stream chunk. Internal T elemetry T ransport: T o pass the measured TTFT up to the router’ s logging and prediction layers without breaking the API contract, the system injects the measurement into the reconstructed JSON payload using a pri vate k ey , iemas t first token monotonic . This key is extracted and remo ved by the routing pipeline before the final response is returned to the client. C . 3 . 3 . R E F E R E N C E D E P L OY M E N T C O N FI G U R AT I O N The system is v alidated ag ainst vLLM backends. The deployment scripts utilize specific flags to enable the advanced memory features relied upon by the router’ s prefix cache and cost models. vLLM Configuration Flags: As seen in the deployment scripts (e.g., ‘llama.sh‘, ‘qwen.sh‘), the backends are launched with the following critical ar guments: • --enable-prefix-caching : This enables the backend’ s internal BlockManager to reuse KV -cache pages across requests sharing common prefixes. This is the physical mechanism that the router’ s TextPrefixCache attempts to model. • --enable-prompt-tokens-details : This instructs vLLM to return detailed breakdo wn of token usage, specif- ically the cached tokens field. The router uses this ground-truth signal to validate its prefix cache predictions and detect eviction e vents. • --max-model-len 4096 : Ensures a consistent context window across heterogeneous models to prev ent out-of- memory errors during high-concurrency micro-batching. 17 IEMAS: An Incentive-Efficiency Routing Framew ork for Open Agentic W eb Ecosystems D. Pseudo-code of IEMAS Algorithm 1 IEMAS: Incentiv e-Efficiency Routing Frame work 1: Input: Batch of client tasks C , Set of agents S with capacities { B i } 2: Global State: Predictive Model M , Agent Prefix Ledgers L = { ¯ p i,d } 3: // Phase 1: Cache-A ware Pr ediction & V aluation 4: for each task j ∈ C and each agent i ∈ S do 5: Retrie ve last cached prompt ¯ p i,d ( j ) from L 6: Compute Cache Af finity ω ij via Eq. 4 7: Extract feature vector x ij (load, cache, metadata) 8: Predict QoS: ( ˆ L ij , ˆ P ij , ˆ C ij ) ← M . predict ( x ij ) 9: Calculate Client V aluation v j via Eq. 1 10: Compute Net W elf are weight w ij ← v j − ˆ C ij 11: if w ij < 0 then 12: Prune edge ( i, j ) { Exclude inef ficient matches } 13: end if 14: end for 15: // Phase 2: W elfar e Maximization (MCMF) 16: Construct bipartite graph G with edge costs − w ij 17: Compute allocation x ∗ ← MinCostMaxFlo w ( G ) { Theorem 4.1 } 18: Calculate T otal W elfare W ( C ) 19: // Phase 3: VCG P ayment & Dispatch 20: for each task j matched to agent i in x ∗ do 21: // Calculate Opportunity Cost (Externality) 22: Re-solve MCMF for G \ { j } to get W ( C \ { j } ) 23: Compute Payment p j via Eq. 8 24: Dispatch task j to agent i 25: end for 26: // Phase 4: Execution & Online Learning 27: for each completed task j by agent i do 28: Observe realized metrics L obs ij , C obs ij , P obs ij 29: M . update ( x ij , labels ) { Hoef fding T ree Regressor Update } 30: L . update ( i, d ( j ) , ne w prompt ) { Update Prefix Ledger } 31: end for 18

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment