Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory
Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query-agnostic memory construction that can be inefficient and may discard query-critical information. Although runtime memory utilization is a natural alternative, prior work often incurs substantial overhead and offers limited explicit control over the performance-cost trade-off. In this work, we present \textbf{BudgetMem}, a runtime agent memory framework for explicit, query-aware performance-cost control. BudgetMem structures memory processing as a set of memory modules, each offered in three budget tiers (i.e., \textsc{Low}/\textsc{Mid}/\textsc{High}). A lightweight router performs budget-tier routing across modules to balance task performance and memory construction cost, which is implemented as a compact neural policy trained with reinforcement learning. Using BudgetMem as a unified testbed, we study three complementary strategies for realizing budget tiers: implementation (method complexity), reasoning (inference behavior), and capacity (module model size). Across LoCoMo, LongMemEval, and HotpotQA, BudgetMem surpasses strong baselines when performance is prioritized (i.e., high-budget setting), and delivers better accuracy-cost frontiers under tighter budgets. Moreover, our analysis disentangles the strengths and weaknesses of different tiering strategies, clarifying when each axis delivers the most favorable trade-offs under varying budget regimes.
💡 Research Summary
Memory management is a critical bottleneck for large‑language‑model (LLM) agents that must operate beyond a single context window. Existing approaches largely rely on offline, query‑agnostic memory construction: past interactions are pre‑processed, compressed, or indexed without regard to the current query. While this “build‑once, use‑always” paradigm simplifies system design, it wastes computation on irrelevant information and can discard details essential for specific queries. The authors therefore propose BudgetMem, a runtime‑oriented memory framework that makes performance‑cost trade‑offs explicit, query‑aware, and controllable.
BudgetMem treats memory extraction as a multi‑stage modular pipeline. In the paper’s concrete instantiation the pipeline consists of a filtering module followed by three parallel extraction modules (entity, temporal, topic) and a final summarization module. Crucially, each module exposes a common three‑tier budget interface—Low, Mid, and High—allowing the system to allocate different amounts of compute to each stage at inference time. The authors explore three orthogonal ways to realize these tiers:
- Implementation tiering – swapping the underlying algorithm (e.g., rule‑based → BERT‑based → full‑scale LLM) to vary implementation complexity.
- Reasoning tiering – altering inference behavior (direct generation → chain‑of‑thought → multi‑step reflection) to trade reasoning depth for cost.
- Capacity tiering – changing the model size (small → medium → large) to adjust raw capacity.
A shared lightweight router sits atop the pipeline. For each incoming query, the router observes the query and the intermediate states produced by earlier modules, then selects a tier for the next module. The router is trained with reinforcement learning (policy gradient) using a composite reward: a performance component (task accuracy, F1, etc.) and a cost component (FLOPs, latency, memory usage). By maximizing this reward, the router learns a policy that automatically balances quality against budget constraints on a per‑query basis.
The authors evaluate BudgetMem on three benchmarks: LoCoMo (dialogue‑centric memory tasks), LongMemEval (long‑document retrieval and summarization), and HotpotQA (multi‑document reasoning). They compare against strong baselines, including static‑budget pipelines, offline‑constructed memory systems, and single‑tier runtime approaches. Results show that in high‑budget settings BudgetMem surpasses the best baselines, achieving up to 3 % absolute gains on HotpotQA. More importantly, under tighter budgets BudgetMem consistently yields superior accuracy‑cost frontiers, reducing computational cost by 15‑30 % while maintaining comparable or better performance.
A detailed ablation reveals the relative strengths of the three tiering strategies. Implementation tiering shines when the budget is very low, providing fast, albeit coarse, filtering. Reasoning tiering becomes advantageous in mid‑to‑high budgets where deeper logical processing (e.g., chain‑of‑thought) yields noticeable accuracy improvements. Capacity tiering quantifies the impact of model size, showing that medium‑sized models often hit the sweet spot between cost and performance when large models are infeasible.
Analysis of the learned routing policy uncovers interpretable patterns: factual, short‑answer queries trigger low‑ or mid‑tier selections, while multi‑step reasoning queries activate high‑tier reasoning modules. The router also adapts to input length, applying low‑budget filtering on long histories before allocating higher budgets to downstream extraction.
The paper acknowledges limitations: the modular pipeline is fixed in the experiments, and extending BudgetMem to other domains (code generation, multimodal tasks) will require new module designs. The reinforcement‑learning stage can be sensitive to reward shaping, and real‑world deployment may need meta‑learning or policy transfer to reduce training overhead.
In summary, BudgetMem introduces a principled, modular, and learnable approach to runtime memory extraction for LLM agents. By exposing per‑module budget tiers and training a cost‑aware router, it enables fine‑grained, query‑specific control over the performance‑cost trade‑off, a capability that has been largely missing from prior memory‑augmented LLM systems. The extensive experiments and analyses demonstrate both the practical gains and the nuanced trade‑offs among implementation, reasoning, and capacity tiering strategies, paving the way for more efficient and adaptable LLM agents in production environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment