Learning to Evict from Key-Value Cache
The growing size of Large Language Models (LLMs) makes efficient inference challenging, primarily due to the memory demands of the autoregressive Key-Value (KV) cache. Existing eviction or compression methods reduce cost but rely on heuristics, such as recency or past attention scores, which serve only as indirect proxies for a token’s future utility and introduce computational overhead. We reframe KV cache eviction as a reinforcement learning (RL) problem: learning to rank tokens by their predicted usefulness for future decoding. To this end, we introduce KV Policy (KVP), a framework of lightweight per-head RL agents trained on pre-computed generation traces using only key and value vectors. Each agent learns a specialized eviction policy guided by future utility, which evaluates the quality of the ranking across all cache budgets, requiring no modifications to the underlying LLM or additional inference. Evaluated across two different model families on the long-context benchmark RULER and the multi-turn dialogue benchmark OASST2-4k, KVP significantly outperforms baselines. Furthermore, zero-shot tests on standard downstream tasks (e.g., LongBench, BOOLQ, ARC) indicate that KVP generalizes well beyond its training distribution and to longer context lengths. These results demonstrate that learning to predict future token utility is a powerful and scalable paradigm for adaptive KV cache management.
💡 Research Summary
The paper tackles the memory bottleneck caused by the key‑value (KV) cache in autoregressive large language models (LLMs). While prior work relies on heuristics such as recency, attention scores, or compression techniques, these methods only provide indirect proxies for a token’s future importance and often add computational overhead. The authors reformulate KV cache eviction as a reinforcement‑learning (RL) ranking problem: predict the future utility of each cached token and sort tokens accordingly.
To this end they introduce KV Policy (KVP), a framework that trains a lightweight RL agent for every attention head. Each agent receives only the key vector, value vector, and token position (no queries or future tokens) and outputs a scalar score via a small MLP. Scores define a Plackett‑Luce distribution over permutations; using Gumbel‑Sort, a full permutation can be sampled in a single parallel step, making training efficient.
The reward is defined offline: for a given permutation, the total reward is the sum over all possible cache budgets b of the “future attention loss” incurred when only the top‑b tokens are kept. Future attention for a token is measured as the sum of attention weights it receives from all subsequent tokens in the training trace. This reward can be computed from pre‑recorded attention matrices without any additional LLM forward passes.
Experiments on two model families (e.g., LLaMA‑2 and Falcon) and on the long‑context benchmark RULER as well as the multi‑turn dialogue benchmark OASST2‑4k show that KVP consistently outperforms strong baselines such as FIFO, LRU, StreamingLLM, H2O, and other heuristic policies. The gains are especially pronounced under tight memory budgets, where KVP preserves the most informative tokens and limits degradation in perplexity and downstream accuracy.
Zero‑shot generalization tests on downstream tasks from the EleutherAI Evaluation Harness—including LongBench, BOOLQ, and ARC—demonstrate that policies learned on synthetic generation traces transfer well to unseen domains and even to longer context lengths than seen during training.
The paper’s contributions are: (1) reframing KV cache eviction as a budget‑agnostic ranking problem; (2) introducing per‑head lightweight RL agents that rely solely on cached keys and values; (3) proposing a global reward that evaluates a ranking across all cache sizes without extra inference; and (4) showing substantial empirical improvements and robust generalization. Limitations include the linear increase in agent parameters with the number of heads and potential over‑fitting to the distribution of training traces. Future work may explore parameter sharing across heads, integration with hierarchical memory off‑loading, and combination with higher‑level context summarization techniques.
Comments & Academic Discussion
Loading comments...
Leave a Comment