ForesightKV: Optimizing KV Cache Eviction for Reasoning Models by Learning Long-Term Contribution

ForesightKV: Optimizing KV Cache Eviction for Reasoning Models by Learning Long-Term Contribution
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recently, large language models (LLMs) have shown remarkable reasoning abilities by producing long reasoning traces. However, as the sequence length grows, the key-value (KV) cache expands linearly, incurring significant memory and computation costs. Existing KV cache eviction methods mitigate this issue by discarding less important KV pairs, but often fail to capture complex KV dependencies, resulting in performance degradation. To better balance efficiency and performance, we introduce ForesightKV, a training-based KV cache eviction framework that learns to predict which KV pairs to evict during long-text generations. We first design the Golden Eviction algorithm, which identifies the optimal eviction KV pairs at each step using future attention scores. These traces and the scores at each step are then distilled via supervised training with a Pairwise Ranking Loss. Furthermore, we formulate cache eviction as a Markov Decision Process and apply the GRPO algorithm to mitigate the significant language modeling loss increase on low-entropy tokens. Experiments on AIME2024 and AIME2025 benchmarks of three reasoning models demonstrate that ForesightKV consistently outperforms prior methods under only half the cache budget, while benefiting synergistically from both supervised and reinforcement learning approaches.


💡 Research Summary

ForesightKV tackles the growing memory and compute burden caused by the linear expansion of the key‑value (KV) cache during long‑context generation in reasoning‑oriented large language models (LLMs). The core idea is to learn a lightweight scoring model that predicts the long‑term contribution of each KV pair, enabling dynamic eviction decisions that preserve essential information while respecting a strict memory budget.

The authors first introduce the “Golden Eviction” algorithm, which constructs optimal eviction traces by looking ahead at future attention scores. For a full reasoning trace, the attention matrix is partitioned into fixed‑length blocks along the query dimension. Block‑wise scores are obtained by average pooling across heads, and for each KV pair the maximum score among all future blocks is taken as its “future score”. At each eviction step, the KV pairs with the lowest future scores are marked for removal, guaranteeing minimal impact on subsequent attention computations. These future scores serve as supervision labels.

In the supervised stage, the scoring model (a two‑layer MLP) receives as input the concatenation of the key, value, and derived attention features. It is trained with a pairwise ranking loss that enforces the predicted scores to respect the ordering of the future scores generated by Golden Eviction. This equips the model with a notion of which KV pairs are likely to become important later in the generation.

To further refine the policy, the authors formulate cache eviction as a Markov Decision Process (MDP). The state consists of the current cache and the scoring model’s outputs; the action is a multinomial sampling that selects a subset of KV pairs to evict. The reward is designed to penalize large increases in language‑model loss for low‑entropy tokens (tokens that the model predicts with high confidence, such as numbers, symbols, and entities). These tokens are especially vulnerable to cache deletions, as demonstrated by a substantial loss surge in the authors’ analysis. The reinforcement learning phase employs the Generalized Reward‑Weighted Policy Optimization (GRPO) algorithm to maximize the cumulative reward, effectively teaching the scorer to avoid deletions that would cause sharp loss spikes. Importantly, the LLM’s parameters remain frozen throughout; only the scorer is updated.

Experiments are conducted on two math‑focused benchmarks, AIME2024 and AIME2025, using three reasoning‑capable LLMs (including Qwen‑3‑4B and DeepSeek‑Math‑7B). Under a cache budget of 2 K tokens, ForesightKV retains 92 % of the original model’s performance; under a 4 K budget it retains 99 %. Compared to prior methods such as SnapKV, R‑KV, and Lancucki‑KV, ForesightKV achieves 1.8–2.2× higher throughput while dramatically reducing the loss increase for low‑entropy tokens (average reduction of 45 %). Qualitative analysis shows that the method successfully preserves semantic‑dependent block patterns, preventing the abrupt performance drops that occur when conventional heuristics discard contextually critical KV pairs.

Key insights from the work include: (1) KV importance is highly dynamic, especially for semantic‑dependent patterns that shift across blocks; (2) future attention scores provide a reliable supervision signal for learning long‑term importance; (3) explicitly rewarding the protection of low‑entropy tokens yields a more stable eviction policy. Limitations are acknowledged: constructing Golden Eviction traces requires full attention matrices, which can be costly for very long sequences, and the approach may be less suited to streaming scenarios where future attention is unavailable.

Overall, ForesightKV presents a compelling two‑stage training paradigm that bridges supervised and reinforcement learning to manage KV caches efficiently. By learning to anticipate the future utility of KV pairs, it achieves near‑original performance with only half the memory footprint, marking a significant step toward practical deployment of reasoning‑heavy LLMs in resource‑constrained environments.


Comments & Academic Discussion

Loading comments...

Leave a Comment