CONCUR: High-Throughput Agentic Batch Inference of LLM via Congestion-Based Concurrency Control

CONCUR: High-Throughput Agentic Batch Inference of LLM via Congestion-Based Concurrency Control
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Batch inference for agentic workloads stresses the GPU key-value (KV) cache in a sustained and cumulative manner, often causing severe throughput degradation well before memory capacity is exhausted. We identify this phenomenon as middle-phase thrashing, a previously under-characterized pathology in which cache efficiency collapses as long-lived agents accumulate state over time. We argue that mitigating this pathology requires moving beyond reactive, request-level cache management to proactive, agent-level admission control. Drawing inspiration from congestion control in distributed systems, we view the KV cache as a shared resource whose efficient utilization depends on feedback-driven regulation. Based on this insight, we present CONCUR, a lightweight control layer that regulates agent admission to bound aggregate cache pressure while preserving execution continuity. CONCUR adapts a cache-aware control algorithm to dynamically adjust the number of active agents using runtime cache signals. Across large models and real-world agent workloads, CONCUR prevents middle-phase thrashing and improves batch inference throughput by up to 4.09x on Qwen3-32B and 1.9x on DeepSeek-V3, while remaining compatible with existing LLM serving systems.


💡 Research Summary

The paper addresses a critical performance problem that emerges when large language models (LLMs) are used as autonomous agents in high‑throughput batch inference scenarios. In such workloads each agent repeatedly calls external tools, appends tool outputs and reasoning steps to its prompt, and therefore its context grows monotonically over many generations. This growth translates into a continuously expanding GPU‑resident key‑value (KV) cache footprint. Existing serving stacks—optimized for short, highly overlapping chat‑style requests—rely on prefix‑caching trees and a Least‑Recently‑Used (LRU) eviction policy. While these mechanisms keep cache hit rates high for static or lightly‑changing prompts, they break down for agentic workloads because agents progress asynchronously: some agents are actively generating tokens while others are paused waiting for tool results. The paused agents’ cache entries become “cold” and are evicted by LRU as active agents consume more cache slots. When a paused agent resumes, the serving engine must reconstruct its entire prefix either by recomputation or by off‑loading the evicted KV blocks to CPU memory and re‑loading them over PCIe. This repeated reconstruction incurs latency that grows with the agent’s accumulated history.

The authors call this phenomenon middle‑phase thrashing. It is distinct from classic memory‑capacity thrashing; the GPU memory is already saturated, yet the cache hit rate collapses dramatically, leading to a vicious cycle of eviction and recomputation that dominates the execution time (over 90 % of the total runtime in their traces). Empirical traces from a large‑scale deployment of DeepSeek‑V3 agents show a three‑phase pattern: a warm‑up phase with high hit rates, a prolonged middle phase with saturated cache usage but low hit rates, and a cooldown phase when agents finish and memory pressure eases. During the middle phase, adding more agents paradoxically reduces overall throughput because it exacerbates cache contention.

To solve this, the paper proposes CONCUR, a lightweight control layer that shifts the granularity of scheduling from individual generation steps to whole agents. Inspired by congestion control in networking, CONCUR treats the KV cache as a shared bandwidth resource and uses runtime feedback (cache usage percentage, hit‑rate degradation) to regulate how many agents are allowed to issue generation requests at any moment. The control algorithm adapts the classic Additive Increase Multiplicative Decrease (AIMD) scheme: when no congestion signal is observed, the allowed number of concurrent agents (the “congestion window”) is increased by a small additive factor; when congestion is detected (e.g., cache usage > 85 % or hit‑rate drops sharply), the window is reduced multiplicatively (e.g., halved). This feedback loop runs at each generation step, dynamically throttling or expanding the agent pool without modifying the underlying serving engine.

CONCUR is designed to be non‑intrusive: it sits between the agent execution system and the LLM serving engine, preserving existing prefix‑tree structures, CPU off‑loading mechanisms, and tool‑calling logic. The authors evaluate CONCUR on two state‑of‑the‑art models—Qwen3‑32B (32 billion parameters) and DeepSeek‑V3—across a range of agent counts (1 to 64) and realistic workloads such as reinforcement‑learning rollouts, data distillation, and large‑scale evaluation. Results show that CONCUR eliminates the middle‑phase thrashing: cache hit rates stay above 70 % throughout execution, and overall throughput improves up to 4.09× for Qwen3‑32B and 1.90× for DeepSeek‑V3 compared with the baseline that uses only LRU eviction. Latency breakdowns reveal that the costly recomputation component, which previously accounted for nearly half of the end‑to‑end latency, is dramatically reduced. Moreover, because CONCUR does not replace existing components, it can be integrated into current serving stacks with minimal engineering effort.

The paper also discusses limitations and future directions. While CONCUR currently bases its control decisions on GPU cache usage and hit‑rate, other resources such as PCIe bandwidth or CPU memory pressure could become bottlenecks in different hardware configurations. The AIMD parameters (additive step size, multiplicative factor, thresholds) were manually tuned for the evaluated workloads; an automatic, workload‑aware tuning mechanism would further improve robustness. Finally, the authors argue that the findings suggest a broader shift in LLM serving architecture: moving from request‑level, stateless scheduling to agent‑level, stateful scheduling, and from reactive eviction policies to proactive, feedback‑driven admission control.

In summary, the paper makes three key contributions: (1) identification and thorough characterization of middle‑phase thrashing as a dominant performance pathology in agentic batch inference; (2) a novel, congestion‑control‑inspired admission‑control framework (CONCUR) that dynamically regulates agent concurrency based on cache pressure; and (3) extensive empirical validation demonstrating substantial throughput gains and compatibility with existing serving engines. This work paves the way for more scalable, efficient deployment of LLM‑powered agents in real‑world, high‑throughput applications.


Comments & Academic Discussion

Loading comments...

Leave a Comment