Did You Check the Right Pocket? Cost-Sensitive Store Routing for Memory-Augmented Agents
Memory-augmented agents maintain multiple specialized stores, yet most systems retrieve from all stores for every query, increasing cost and introducing irrelevant context. We formulate memory retrieval as a store-routing problem and evaluate it usin…
Authors: Madhava Gaikwad
D I D Y O U C H E C K T H E R I G H T P O C K E T ? C O S T - S E N S I T I V E S T O R E R O U T I N G F O R M E M O RY - A U G M E N T E D A G E N T S Madhav a Gaikwad ICLR 2026 W orkshop on Memory and State in LLM-Based Agents A B S T R AC T Memory-augmented agents maintain multiple specialized stores, yet most systems retrieve from all stores for ev ery query , increasing cost and introducing irrele vant conte xt. W e formulate memory retriev al as a store-routing problem and e valuate it using cov erage, e xact match, and token ef ficiency metrics. On downstream question answering, an oracle router achie ves higher accuracy while us- ing substantially fewer context tokens compared to uniform retrie val, demonstrating that selecti ve retriev al improves both efficienc y and performance. Our results sho w that routing decisions are a first-class component of memory-augmented agent design and motiv ate learned routing mechanisms for scalable multi-store systems. W e additionally formalize store selection as a cost-sensitive deci- sion problem that trades answer accuracy against retriev al cost, pro viding a principled interpretation of routing policies. 1 I N T R O D U C T I O N Consider a user asking an agent: “What was my weight before I started the diet?” This query depends on historical information from a previous chat session and therefore should draw from long-term memory . It does not require the current-session context, and it typically does not require raw transcripts. Y et many memory-augmented agent framew orks retrie ve from multiple stores or large aggre gated memory contexts re gardless of query type, relying on the language model to filter relev ant information during generation (Lewis et al., 2020; Liu et al., 2023a; Chase, 2023). Uniform retriev al carries two costs. First, it creates computational waste by querying memory stores that cannot contain the answer . Second, it can cause accuracy degradation : irrelev ant or noisy retrie ved context can hinder the model’ s ability to identify answer-bearing information, particularly in long-context settings where additional tokens reduce the signal-to-noise ratio (Liu et al., 2023b; Y u et al., 2024; Zhang et al., 2026). W e formalize store selection as a routing problem . Given a query , the router chooses which stores to search before retriev al. This decouples store selection from within-store ranking and makes the accuracy–cost tradeof f explicit. Our e valuation separates two questions. First, do routing policies choose the stores that should be searched for a gi ven query? W e test this with synthetic r outing labels deri ved from query taxonomies (Section 5.1). Second, if the router chooses the right stores, does this improv e do wnstream QA accurac y on r eal LLMs ? W e test this with an oracle router and fixed store subsets (Section 5.2). This setup isolates the value of store selection from model capability and from within-store retriev al quality . Contributions. (1) W e define routing metrics (coverage, e xact match, waste) that capture complementary aspects of store selection quality . (2) W e introduce simple routing baselines, including a conservati ve hybrid heuristic that emphasizes coverage. (3) W e ev aluate routing policies using both synthetic labels and real LLM-based QA, showing that selectiv e store choice can reduce tokens while improving accuracy , and that uniform retrie val can underperform despite using more context. (4) W e formalize store routing as a cost-sensiti ve subset-selection problem, providing a decision-theoretic framew ork that explains when selecti ve retriev al improves both ef ficiency and accuracy . 2 R E L A T E D W O R K Adaptive Retrieval. Prior work studies when to retrie ve (Self-RA G Asai et al. (2023), FLARE Jiang et al. (2023)) and how much to retrie ve (Adapti ve-RA G Jeong et al. (2024)). Recent methods explore which r etriever to use through learning-to-rank formulations Kim & Diaz (2025) or self-reflection mechanisms W u et al. (2026). SmartRAG Gao et al. (2025) uses reinforcement learning for joint retriev al and generation optimization. W e consider a complementary question: where retriev al should occur when memory is partitioned across heterogeneous stores with distinct semantic roles (e.g., episodic vs. semantic memory). 1 Memory Systems. MemGPT Packer et al. (2024) implements hierarchical memory with explicit management opera- tions. Generative Agents Park et al. (2023) rely on reflection and importance scoring to consolidate information over time. Recent architectures organize memory hierarchically for multi-agent reasoning Zhang et al. (2025). These sys- tems focus primarily on memory or ganization ; we instead focus on memory access , specifically cost-sensiti ve routing across stores prior to retriev al. Our store taxonomy draws on cognitiv e distinctions between episodic and semantic memory T ulving et al. (1972), and translates these distinctions into operational routing decisions. Multi-Index and Federated Retriev al. The information retriev al literature has long examined federated search across heterogeneous collections Callan (2002); Si & Callan (2003); Shokouhi et al. (2011), where resource selection algo- rithms estimate relev ance distrib utions for each collection Lu & Callan (2006). These methods route queries across independent search engines; in our setting, the same principles apply to agent memory stores that differ semantically (e.g., current dialogue vs. historical summaries) rather than by document source. Retrieval Routing and Selection. Recent RA G work shows that routing queries across specialized retriev ers can improv e both efficienc y and accuracy Kim & Diaz (2025); W u et al. (2026); Guo et al. (2025). ExpertRAG Gumaan (2025) applies mixture-of-e xperts routing to context selection, while RAP-RA G Ji et al. (2025) plans adaptiv e retriev al sequences for multi-hop reasoning. Unlike retrie ver routing, our setting routes across persistent memory stores whose contents reflect distinct temporal or semantic roles (e.g., STM vs. L TM), requiring store-lev el rather than passage-le vel decisions. Context Noise in Long Documents. Long-context modeling faces signal-to-noise challenges when irrelev ant tokens distract attention Li et al. (2024b). Recent analyses show that long-context LLMs often underperform targeted retriev al despite larger windo ws Li et al. (2024a), and that retriev al decisions themselves benefit from uncertainty-guided poli- cies Chen et al. (2026). Our work complements these findings by sho wing that stor e-level noise, retrieving information from irrelev ant memory types, can further degrade performance, especially when each store contrib utes hundreds of tokens. Memory Benchmarks. LoCoMo Maharana et al. (2024) ev aluates multi-session conv ersational memory . Long- MemEval W u et al. (2025) benchmarks knowledge updates and temporal ordering. W e use their question taxonomies to deriv e store-labeling protocols. 3 P R O B L E M F O R M U L A T I O N 3 . 1 M E M O RY A R C H I T E C T U R E A memory store is an independent index containing semantically related information. Each store can be queried separately , and the retriev ed content is concatenated into the LLM context. Following MemGPT , we consider four stores: Short-T erm Memory (STM) holds the current conv ersation, typically the last N turns. Queries such as “what did I just mention” or “today’ s meeting” require STM. Summary Store contains compressed user facts, including preferences, biographical details, and ongoing projects. Queries such as “what is my phone number” target this store. Long-T erm Memory (L TM) stores summaries of past con versations. Queries about earlier discussions (“what did we talk about last week”) rely on L TM. Episodic Memory preserves raw transcripts. Queries that require exact wording or precise timestamps may need episodic memory . 3 . 2 T H E R O U T I N G P RO B L E M Let S = { STM , Sum , L TM , Epi } denote the set of stores, where Sum is the Summary store and Epi is episodic memory . Giv en a query q , a routing policy π selects stores ˆ G = π ( q ) ⊆ S . The system retriev es content from the selected stores and prompts the LLM. Let G denote the ground-truth stores , the stores containing the information needed to answer q . In our synthetic routing ev aluation, we deriv e G from query type: a “single-hop fact” query has G = { Sum } ; a “temporal comparison” query has G = { L TM , Epi } . See Section 9 for the full mapping. 2 3 . 3 E V A L UA T I O N M E T R I C S W e e valuate routing with three metrics: Coverage measures whether all necessary stores were included: Cov erage = 1 N X i 1 [ G i ⊆ ˆ G i ] (1) Under our ev aluation protocol (full-store concatenation), a co verage failure means the answer is not retriev able from the provided conte xt. Exact Match (EM) measures whether the policy selects precisely the required stores: EM = 1 N X i 1 [ G i = ˆ G i ] (2) High EM corresponds to efficient routing without o ver-retrie val. W aste counts unnecessary stores retrie ved: W aste = 1 N X i | ˆ G i \ G i | (3) W aste is a store-lev el proxy for token cost because each additional store contributes retrieved tokens, and it can also reduce accuracy by introducing conte xtual noise. Cost. W e measure cost as context tokens , the number of tokens inserted into the prompt (counted via tiktoken). This serves as a direct proxy for inference cost. 1 3 . 4 C O S T - S E N S I T I V E S TO R E R O U T I N G : A D E C I S I O N F R A M E W O R K Store selection can be viewed as a cost-sensitive subset-selection problem. Let S denote the set of av ailable memory stores, and let c s represent the retriev al cost associated with store s ∈ S (e.g., context tokens or infrastructure access cost). Gi ven a query q , a routing policy selects a subset of stores G ⊆ S before retriev al. The objectiv e is to balance answer quality against retriev al cost. Let Acc ( q , G ) denote the expected probability that the downstream LLM produces a correct answer when stores G are retriev ed. A cost-sensitive routing policy can therefore be defined as π ∗ ( q ) = arg max G ⊆S " E Acc ( q , G ) − λ X s ∈ G c s # , (4) where λ controls the tradeoff between accurac y and retriev al cost. This formulation highlights se veral useful interpretations. Uniform retriev al corresponds to the special case λ = 0 , where all stores are retriev ed reg ardless of cost. Oracle routing approximates the optimal solution when the relev ant stores for each query are kno wn. Heuristic routing policies attempt to approximate π ∗ using semantic signals extracted from the query . Importantly , store routing differs from retriever routing. Retriever routing selects which search system or index to query , typically assuming a homogeneous document collection. Store routing operates at the memory-architecture lev el, where each store contains information with a distinct semantic role (e.g., short-term context, persistent user facts, or historical sessions). The routing decision therefore determines not only which documents are retrieved but also the effecti ve signal-to-noise ratio of the context presented to the language model. This perspectiv e also clarifies the empirical findings in our ev aluation. When irrelev ant stores are retriev ed, the ef- fectiv e retriev al cost increases while the probability of extracting the correct evidence can decrease due to contextual noise. Con versely , selecting a smaller b ut semantically appropriate subset of stores impro ves both ef ficiency and answer reliability . The oracle–heuristic gap observed in Section 6 suggests that future systems should learn routing policies that optimize downstream QA objecti ves directly , rather than relying solely on rule-based heuristics. 1 W e also considered store-access latency . Results are similar; see Section 11. 3 4 M E T H O D W e compare policies that span a simple-to-strong spectrum. The key difference lies in what information the router uses and how conserv ativ e it is about missing required stores. Uniform Baseline. Always retrie ve from all stores. This guarantees perfect cov erage but results in the highest cost and waste. Many deployed systems use this default beha vior . Oracle Upper Bound. Use ground-truth store labels. This achie ves perfect cov erage and EM and serves as a cost- efficient upper bound under our labeling protocol. The oracle is not deployable, but it quantifies the headroom av ailable from store selection alone. Fixed Subset Policies. Retriev e from a fixed subset such as STM+Sum+L TM. These policies are deployable and provide a strong baseline when adapti ve routing is unav ailable. Hybrid Heuristic (baseline). W e also test a simple deployable heuristic that combines semantic pattern matching with a conservati ve fallback. Algorithm 1 shows the core rule-based logic. In practice, we also use query-store embedding similarity as a tiebreaker when no pattern fires, which contributes an additional 4% co verage (see ablation in Section 6). W e present the hybrid as a baseline rather than a final router , because its do wnstream QA performance leav es substantial room for learned routing. Algorithm 1 Hybrid Store Routing (core rules; embedding similarity used as tiebreaker when no pattern matches) 1: Input : Query q 2: Extract semantic signals from q 3: if quantity signal (“list all”, “every”) then 4: retur n {L TM, Epi} {Exhausti ve recall} 5: else if temporal signal (“before”, “changed”) then 6: retur n {L TM, Epi} {Historical comparison} 7: else if multi-hop signal (“compare”, “relate”) then 8: retur n {Sum, L TM} {Cross-reference} 9: else if current session (“just said”, “today”) then 10: retur n {STM} {Recent context} 11: else if fact lookup (“what is my”, “who is my”) then 12: retur n {Sum} {User profile} 13: else 14: retur n {Sum, L TM} {Safe fallback} 15: end if Fallback Choice. W e tested all six two-store combinations. Sum+L TM yields the highest coverage (89%), making it the safest default. Design Rationale. W e optimize first for coverage, since missing a required store makes the question effecti vely unanswerable. When semantic signals are clear , we route narrowly; when they are ambiguous, we fall back to the combination that recov ers the most cases. 5 E X P E R I M E N T S 5 . 1 S Y N T H E T I C R O U T I N G E V A L UA T I O N Before testing routing policies on LLM-based question answering, we first verify whether the policies select the appropriate memory stores under controlled conditions using synthetic labels. This preliminary step allo ws us to isolate the routing decision itself, independent of retriev al quality or model reasoning, and ensures that downstream performance differences can be interpreted more clearly . Store-Labeling Protocol. W e deri ve ground-truth store labels from the query taxonomies used in LoCoMo and LongMemEval. Each query type is associated with the memory stores that contain the information required to answer it. For e xample, single-hop fact queries typically depend on the Summary store, while temporal or comparison queries often require information from both Long-T erm Memory and Episodic Memory . The complete mapping of query categories to store requirements is pro vided in Section 9. 4 These labels are generated automatically using semantic rules deriv ed from the benchmark taxonomies rather than manual annotation. As a result, the labeling process remains scalable and reproducible across datasets. Because the labels are based on query semantics rather than observed retriev al outcomes, they provide a consistent reference for ev aluating routing decisions ev en when retriev al pipelines or underlying models change. Dataset. 1,000 synthetic queries across 7 types, 70/30 train/test split. T able 1: Synthetic routing ev aluation. Metrics measure store selection quality (not QA accuracy). Policy Cov erage Exact Match W aste Uniform 100% 8% 2.9 Rule-based (linguistic only) 57% 35% 0.5 Hybrid (ours) 94% 58% 1.2 Oracle 100% 100% 0.0 Findings. The uniform policy achiev es perfect cov erage because it always retrie ves from every store. Howe ver , its exact match rate is only 8%, reflecting substantial over -retrie val. Since all stores are included regardless of query requirements, the policy rarely selects the minimal set of stores needed to answer a question. The linguistic rule-based baseline performs better in terms of precision but suffers from limited coverage, reaching only 57%. Its performance declines on query types that lack explicit surface cues, such as temporal comparisons or multi-hop reasoning tasks where the required stores cannot be inferred from simple keyw ord patterns. The hybrid heuristic improv es co verage substantially , reaching 94%, while maintaining a moderate exact match rate of 58%. This gain is primarily driv en by combining semantic pattern detection with a conserv ati ve fallback strategy . When the heuristic detects clear signals, it routes narrowly; when signals are weak or ambiguous, it selects a broader but still constrained set of stores, which helps reco ver otherwise missed cases. 5 . 2 L L M E V A L UA T I O N Setup. W e e valuate 12 store-selection policies on a dataset of 150 questions using GPT -3.5-turbo and GPT -4o-mini. The questions span seven query types and are tested under two context regimes. The short condition includes 100 questions with approximately 200 tokens retriev ed per store, while the long condition includes 50 questions with approximately 1000 tokens retrie ved per store. This design allows us to study both moderate-context and high-context retriev al settings. All ev aluations use temperature 0 to reduce generation variance and substring-based answer matching for scoring. 2 T able 2: LLM ev aluation (150 questions). Accuracy = QA correctness. T ok ens = context size. Model Policy Overall Short Long T ok ens GPT -3.5 oracle 85.3% 93% 70% 299 stm+sum+ltm 85.3% 92% 72% 591 uniform 83.3% 91% 68% 787 hybrid 69.3% 79% 50% 379 GPT -4o-mini oracle 86.7% 94% 72% 299 stm+sum+ltm 84.7% 92% 70% 591 uniform 81.3% 92% 60% 787 hybrid 70.7% 80% 52% 379 Finding 1: Stor e selection can impro ve accuracy and r educe tokens. Oracle routing outperforms uniform retrie val on both efficienc y and answer quality . It achieves higher accuracy (86.7% vs 81.3%) while using 62% fe wer context tokens (299 vs 787). This result highlights that providing more context does not necessarily improve performance. 2 Accuracy dif ferences greater than 4% are statistically significant at p < 0 . 05 via bootstrap resampling (1,000 iterations). 5 When the router selects only the stores that are likely to contain the answer, the model receiv es a smaller but cleaner context, which can lead to more reliable e xtraction. Finding 2: Long context amplifies the penalty of over -retriev al. The difference between routing policies becomes more pronounced in the long-context setting. On long-context questions, oracle routing reaches 72% accuracy com- pared with 60% for uniform retriev al. When each store contrib utes roughly 1000 tokens, adding irrele vant stores significantly increases the amount of distracting text, making it harder for the model to identify the correct informa- tion. Finding 3: A str ong fixed policy is competitiv e. A fix ed routing polic y such as STM+Sum+L TM approaches oracle- lev el accuracy while maintaining moderate cost. This suggests that even simple routing strategies can capture much of the benefit of selective retrie val when the memory architecture is well structured. Episodic retriev al, in contrast, is required only in a small subset of cases and can sometimes degrade performance by introducing unnecessary conte xt. Finding 4: Heuristic routing is not yet sufficient. Although the hybrid heuristic achiev es strong coverage on the synthetic routing benchmark, it performs less well on downstream QA tasks. This gap indicates that selecting the correct stores in principle does not always translate into better end-to-end performance. Do wnstream accuracy depends not only on store selection but also on the model’ s ability to retriev e and use the relev ant e vidence within the provided context. 5 . 3 W H Y D O E S U N I F O R M U N D E R P E R F O R M ? Uniform retriev al provides more information than oracle routing, yet it consistently yields lo wer accuracy . T wo factors help explain this beha vior . Needle in a haystack. W ith approximately 787 tokens in the prompt, the model must locate a small set of rele vant facts embedded within a larger amount of unrelated text. This difficulty becomes more pronounced in long-context queries, where each additional store can contribute roughly 1000 tok ens, further diluting the signal. Conflicting information. Different stores may contain inconsistent or outdated facts. For example, long-term memory may preserve earlier information that conflicts with updated user summaries. When all stores are retrieved together , the model must decide which source to trust, and it may occasionally select the wrong one. Example. Consider the query “Who is my current manager?” The Summary store contains the entry “Manager: Jennifer W illiams. ” The L TM store contains a pre vious session note: “Before the reorg, user reported to Michael T orres. Now reports to Jennifer W illiams. ” Under uniform retrie val, both passages appear in context, and GPT -4o- mini sometimes extracts “Michael T orres” from the more detailed historical passage. Under oracle routing, which retriev es only the Summary store, the model consistently returns “Jennifer Williams. ” This example illustrates ho w additional context can sometimes mislead the model rather than impro ve performance. 5 . 4 U N D E R S TA N D I N G T H E C O V E R A G E - A C C U R AC Y G A P The hybrid heuristic achie ves 94% routing cov erage (Section 5.1) but only 70% QA accuracy . Several factors con- tribute to this dif ference. Coverage is not accuracy . Coverage measures whether the router included the stores that contain the required infor- mation. QA accuracy measures whether the model successfully identifies and uses that information in the retrie ved context. Even when the correct store is present, extraction can f ail if the context is long or contains distracting material. Missing stores remain decisive. When the hybrid heuristic fails to select a required store, which occurs in 6% of queries, the question ef fecti vely becomes unanswerable under the retrie val setup. These cases directly contrib ute to the ov erall error rate. Over -retriev al can also hurt. The hybrid policy achiev es 58% exact match, meaning that 42% of queries retrieve additional stores be yond those required. The resulting extra context can distract the model and reduce answer accuracy ev en when the correct stores are included. Overall, the cov erage-accuracy gap reflects two distinct failure modes. The first is routing error , where the heuristic misses a necessary store (12% of QA errors). The second is extraction error , where the model fails to locate or correctly use the answer despite correct store selection (18% of QA errors), often because over -retriev al introduces additional contextual noise. 6 6 A N A L Y S I S Our results highlight two distinct phenomena. First, selecting the appropriate stores before retriev al can simultaneously reduce context tokens and improve do wnstream QA accuracy . Second, simple heuristic routing, although effecti ve, does not fully close the performance gap relativ e to oracle routing. T o better understand which signals drive routing quality , we analyze feature contributions and computational o verhead. 6 . 1 F E AT U R E A B L AT I O N W e e valuate different feature groups on the synthetic routing benchmark to understand which signals are most useful for identifying the correct stores. T able 3: Feature ablation: routing coverage by feature type. Features Cov erage ∆ Linguistic (pronouns, tense) 57% baseline + Semantic (quantity , temporal, multi-hop) 90% +33% + Embedding similarity 94% +4% Semantic signals dominate. Adding semantic indicators such as quantity (“list all”), temporal references (“before”), and multi-hop reasoning cues (“compare”) increases coverage by 33 percentage points. These signals capture query patterns that simple linguistic features, such as pronouns or tense, often fail to identify . As a result, routing policies that rely only on shallo w linguistic patterns tend to miss a substantial fraction of queries requiring historical or multi-store reasoning. Embedding similarity provides an additional 4 percentage point impro vement. Its main benefit appears on queries that do not contain clear surface cues but still hav e a semantic relationship to specific stores. In these cases, similarity scores help guide the router tow ard likely stores e ven when e xplicit rule-based triggers are absent. 6 . 2 C O M P U TA T I O N A L O V E R H E A D Routing introduces minimal additional latency . Rule-based routing policies add less than 1 ms of processing time, while embedding-based signals add approximately 5 ms. Both ov erheads are small relati ve to typical LLM inference times, which range from 500 to 2000 ms. Because routing reduces the amount of retrieved context, the resulting 62% token reduction directly translates into lower inference cost. In practice, this reduction also decreases prompt processing time and can improve system responsiv eness without requiring any modification to the underlying language model. 7 L I M I T A T I O N S Synthetic labels. Ground-truth store labels are deriv ed from query taxonomies rather than human annotation. This protocol allo ws controlled and reproducible e valuation of routing beha vior, but it does not fully capture the variability present in real-world deployments. In practice, the necessity of a store may depend on how information is written, summarized, or updated ov er time. Human validation of store requirements would therefore pro vide a stronger assess- ment of routing accuracy in production settings. Heuristic router gap. The hybrid heuristic achieves high coverage on the synthetic routing benchmark but under- performs on downstream QA compared with oracle routing and strong fixed policies. This difference suggests that routing decisions interact closely with within-store retrie val quality and answer extraction. Closing the gap will lik ely require learned routing approaches that jointly optimize store selection and evidence retrie val rather than relying solely on rule-based heuristics. T wo model families. Our ev aluation includes GPT -3.5-turbo and GPT -4o-mini. While these models represent com- monly used systems, testing additional architectures and training paradigms w ould strengthen claims about generaliza- tion. In particular , models with different context handling strategies or retriev al sensitivities may respond differently to routing policies. 7 Full store retrie val. W e concatenate full store contents instead of performing top- k retriev al within each store. This design simplifies the analysis by isolating store-selection ef fects, but production systems often apply ranking or filter - ing within each memory store. The interaction between routing decisions and within-store retriev al strategies remains an important direction for future study . 8 C O N C L U S I O N W e formalized memory store selection as a routing problem and ev aluated it in two stages: first using synthetic labels to validate store-selection behavior , and then using real LLMs to measure downstream question answering performance. This two-stage ev aluation separates the quality of routing decisions from the ef fects of retriev al and generation, allowing us to analyze the role of store selection more directly . Our results show several consistent patterns. Selectiv e store choice can improv e both ef ficiency and accuracy , reduc- ing context tokens by 62% while increasing QA accurac y (86% vs 81%). Uniform retrie val, in contrast, can introduce unnecessary context that reduces performance, particularly in long-context settings where additional stores contrib ute large amounts of irrelev ant te xt. Semantic signals substantially improve routing quality , increasing coverage by 33 per - centage points compared with linguistic features alone. A simple fixed policy such as STM+Sum+L TM also provides a strong and deployable default when adapti ve routing is not av ailable. W e additionally introduced a decision-theoretic formulation of store routing that treats memory access as a cost- sensitiv e optimization problem. This frame work e xplains the observed accuracy–efficienc y tradeof fs and clarifies why routing decisions can influence answer quality ev en when retriev al and language models remain unchanged. The remaining 16-point gap between heuristic routing (70%) and oracle routing (86%) indicates that further gains are possible. Closing this gap will likely require routing policies that are learned end-to-end and optimized directly for downstream QA outcomes rather than relying solely on hand-designed heuristics. W e hope this work encourages further research on learned routing methods and highlights the importance of store selection as a core component of memory-augmented agent systems. 9 Q U E RY T Y P E T O S T O R E M A P P I N G T o ev aluate routing behavior under controlled conditions, we assign each query type a set of stores that contain the information r equired to answer it. The mapping reflects ho w information is distrib uted across the memory architecture rather than ho w any particular retrie val system performs. For example, factual user attributes are stored in the Sum- mary store, while historical comparisons typically require information from both Long-T erm Memory and Episodic Memory . The goal of this mapping is to provide a consistent reference for measuring routing quality across policies. Because the mapping is deri ved from benchmark query taxonomies, it remains reproducible across datasets and does not depend on model outputs or manual labeling decisions. T able 9 lists the resulting query-type to store assignments used throughout our synthetic routing ev aluation. T able 4: Ground-truth store labels by query type. Query T ype Stores Rationale single_hop Sum Simple fact lookup single_session STM References current conv ersation recent_session L TM References past con versations multi_hop Sum + L TM Combines facts with history memory_capacity L TM + Epi Exhaustiv e recall (“list all”) temporal L TM + Epi Historical comparison (“before X”) knowledge_update Sum + L TM Current vs historical state 8 1 0 F U L L P O L I C Y C O M P A R I S O N T able 5: All 12 policies (GPT -4o-mini, 150 questions). Policy Accuracy Short Long T okens oracle 86.7% 94% 72% 299 stm+sum+ltm 84.7% 92% 70% 591 uniform 81.3% 92% 60% 787 summary+ltm 74.0% 82% 58% 406 hybrid 70.7% 80% 52% 379 ltm 49.3% 58% 32% 212 ltm+episodic 49.3% 57% 34% 408 stm+summary 45.3% 48% 40% 379 summary 30.7% 36% 20% 195 episodic 30.7% 43% 6% 196 stm 14.7% 14% 16% 184 none 14.0% 2% 38% 0 1 1 S T O R E - A C C E S S C O S T A N A LY S I S In addition to token cost, we examined the relati ve infrastructure cost associated with accessing dif ferent memory stores. In practical deployments, stores may reside on different storage tiers or require different retrie val pipelines, which can lead to varying access latenc y and compute overhead. T o approximate these differences, we assign relati ve store-access costs: STM=1, Summary=1, L TM=3, and Episodic=5. These values represent normalized relative effort rather than exact system measurements. Under this model, uniform retriev al incurs the highest cost (10), since it accesses all stores for ev ery query . Oracle routing av erages 3.9, reflecting its ability to select only the stores required for a giv en question. The hybrid heuristic reduces the av erage store-access cost by 29% while maintaining 94% coverage, demonstrating that routing policies can meaningfully reduce system ov erhead ev en when implemented with simple decision rules. Although the main paper focuses on token cost because it directly influences LLM inference expense, the store- access analysis sho ws that similar ef ficiency gains appear at the infrastructure lev el. As routing policies become more adaptiv e, reductions in store-access ov erhead may translate into lower latency and improved scalability in production memory systems. R E F E R E N C E S Akari Asai, Zeqiu W u, Y izhong W ang, A virup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retriev e, generate, and critique through self-reflection, 2023. URL . Jamie Callan. Distributed information retriev al. In W . Bruce Croft (ed.), Advances in Information Retrieval: Recent Resear ch fr om the Center for Intelligent Information Retrieval , pp. 127–150. Springer , 2002. Harrison Chase. Langchain: Building applications with llms through composability , 2023. https://www .langchain.com. W ang Chen, Guanqiang Qi, W eikang Li, Y ang Li, Deguo Xia, and Jizhou Huang. Decide then retrie ve: A training-free framew ork with uncertainty-guided triggering and dual-path retriev al, 2026. URL 2601.03908 . Jingsheng Gao, Linxu Li, W eiyuan Li, Y uzhuo Fu, and Bin Dai. Smartrag: Jointly learn rag-related tasks from the en vironment feedback, 2025. URL . Esmail Gumaan. Expertrag: Efficient rag with mixture of experts – optimizing context retriev al for adaptiv e llm responses, 2025. URL . Y ucan Guo, Miao Su, Saiping Guan, Zihao Sun, Xiaolong Jin, Jiafeng Guo, and Xueqi Cheng. Routerag: Efficient retriev al-augmented generation from text and graph via reinforcement learning, 2025. URL https://arxiv. org/abs/2512.09487 . 9 Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C. Park. Adaptive-rag: Learning to adapt retriev al-augmented large language models through question complexity , 2024. URL abs/2403.14403 . Xu Ji, Luo Xu, Landi Gu, Junjie Ma, Zichao Zhang, and W ei Jiang. Rap-rag: A retrie val-augmented generation framew ork with adaptiv e retriev al task planning. Electronics , 14(21):4269, 2025. Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Y u, Y iming Y ang, Jamie Callan, and Graham Neubig. Active retriev al augmented generation. In Pr oceedings of the 2023 Conference on Empirical Methods in Natural Languag e Pr ocessing , pp. 7969–7992, 2023. T o Eun Kim and Fernando Diaz. Ltrr: Learning to rank retriev ers for llms, 2025. URL abs/2506.13743 . Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler , Mike Lewis, W en-tau Y ih, T im Rocktäschel, et al. Retriev al-augmented generation for knowledge-intensi ve nlp tasks. Advances in neural information pr ocessing systems , 33:9459–9474, 2020. Zhuow an Li, Cheng Li, Mingyang Zhang, Qiaozhu Mei, and Michael Bendersky . Retriev al augmented generation or long-context llms? a comprehensiv e study and hybrid approach, 2024a. URL 2407.16833 . Zixuan Li, Jing Xiong, F anghua Y e, Chuanyang Zheng, Xun W u, Jianqiao Lu, Zhongwei W an, Xiaodan Liang, Cheng- ming Li, Zhenan Sun, Lingpeng K ong, and Ngai W ong. Uncertaintyrag: Span-level uncertainty enhanced long- context modeling for retriev al-augmented generation, 2024b. URL . Jerry Liu et al. Llamaindex: A data frame work for llm applications, 2023a. https://www .llamaindex.ai. Nelson F . Liu, Ke vin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts, 2023b. URL . Jie Lu and Jamie Callan. Full-text federated search of te xt-based digital libraries in peer-to-peer networks. Information Retrieval , 9(4):477–498, 2006. Adyasha Maharana, Dong-Ho Lee, Serge y T ulyakov , Mohit Bansal, Francesco Barbieri, and Y uwei Fang. Evaluating very long-term con versational memory of llm agents. arXiv pr eprint arXiv:2402.17753 , 2024. Charles Pack er , Sarah W ooders, K evin Lin, V i vian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Memgpt: T o wards llms as operating systems, 2024. URL . Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generativ e agents: Interactive simulacra of human behavior . In Proceedings of the 36th annual acm symposium on user interface softwar e and technology , pp. 1–22, 2023. Milad Shok ouhi, Luo Si, et al. Federated search. F oundations and tr ends® in information r etrieval , 5(1):1–102, 2011. Luo Si and Jamie Callan. Relev ant document distribution estimation method for resource selection. In Proceedings of the 26th annual international ACM SIGIR confer ence on Resear ch and development in informaion retrie val , pp. 298–305, 2003. Endel T ulving et al. Episodic and semantic memory . Or ganization of memory , 1(381-403):1, 1972. Di W u, Hongwei W ang, W enhao Y u, Y uwei Zhang, Kai-W ei Chang, and Dong Y u. Longmemev al: Benchmarking chat assistants on long-term interactiv e memory , 2025. URL . Di W u, Jia-Chen Gu, Kai-W ei Chang, and Nanyun Peng. Self-routing rag: Binding selectiv e retrie val with knowledge verbalization, 2026. URL . Y ue Y u, W ei Ping, Zihan Liu, Boxin W ang, Jiaxuan Y ou, Chao Zhang, Mohammad Shoe ybi, and Bryan Catanzaro. Rankrag: Unifying conte xt ranking with retriev al-augmented generation in llms. Advances in Neural Information Pr ocessing Systems , 37:121156–121184, 2024. Guibin Zhang, Muxin Fu, Guancheng W an, Miao Y u, Kun W ang, and Shuicheng Y an. G-memory: T racing hierarchical memory for multi-agent systems. arXiv preprint , 2025. Qianchi Zhang, Hainan Zhang, Liang Pang, Y ongxin T ong, Hongwei Zheng, and Zhiming Zheng. Less is more: Compact clue selection for efficient retriev al-augmented generation reasoning, 2026. URL https://arxiv. org/abs/2502.11811 . 10 A M E M O RY C O N T E N T E X A M P L E S T o illustrate ho w information is distrib uted across stores, we present representati ve examples dra wn from the synthetic memory construction used in our experiments. These examples demonstrate the semantic roles of each store rather than the exact e valuation instances. Short-T erm Memory (STM). STM contains recent con versational context, such as scheduling or clarification re- quests: “User just mentioned they ha ve a meeting at 3pm today with the marketing team. ” “User asked to schedule a follo w-up call for tomorrow morning. ” Queries such as “What did I just say about today’ s meeting?” require STM. Summary Store. The Summary store contains stable user facts and preferences: “User Profile: Name is Alex Chen. W orks at T echCorp as Senior Softw are Engineer . ” “Contact: Phone number is 555-867-5309. Email is ale x.chen@techcorp.com. ” “Manager: Jennifer W illiams. ” Fact lookup queries such as “Who is my manager?” primarily depend on this store. Long-T erm Memory (L TM). L TM stores summaries of past con versations: “Session 9 (yesterday): Before the recent reorg, user’ s manager was Michael T orres. No w reports to Jennifer W illiams. ” “Session 10 (last week): User mentioned weight was 185 lbs before starting a diet. ” Queries in volving historical comparisons or past discussions typically require L TM. Episodic Memory . Episodic memory preserves raw con versation turns: “Session 10, T urn 2: User said ’Back in January I was 185 pounds. With the diet I started, I’m hoping to get down to 170 by summer . ’ ” Queries requiring exact wording or timestamped references may depend on episodic retrie val. B C O D E A V A I L A B I L I T Y All code used to generate synthetic routing labels, run routing-policy ev aluations, and reproduce the LLM-based QA experiments is a vailable in follo wing repository: https://github.com/krimler/memroute The repository includes scripts for synthetic memory generation, routing-policy e valuation, ablation studies, and end- to-end QA benchmarking. Instructions for reproducing the e xperimental results are pro vided in the project README. All experiments can be e xecuted using the configuration files included in the repository . 11
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment