ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation
The prefill stage of long-context Retrieval-Augmented Generation (RAG) is severely bottlenecked by computational overhead. To mitigate this, recent methods assemble pre-calculated KV caches of retrieved RAG documents (by a user query) and reprocess selected tokens to recover cross-attention between these pre-calculated KV caches. However, we identify a fundamental “crowding-out effect” in current token selection criteria: globally salient but user-query-irrelevant tokens saturate the limited recomputation budget, displacing the tokens truly essential for answering the user query and degrading inference accuracy. We propose ProphetKV, a user-query-driven KV Cache reuse method for RAG scenarios. ProphetKV dynamically prioritizes tokens based on their semantic relevance to the user query and employs a dual-stage recomputation pipeline to fuse layer-wise attention metrics into a high-utility set. By ensuring the recomputation budget is dedicated to bridging the informational gap between retrieved context and the user query, ProphetKV achieves high-fidelity attention recovery with minimal overhead. Our extensive evaluation results show that ProphetKV retains 96%-101% of full-prefill accuracy with only a 20% recomputation ratio, while achieving accuracy improvements of 8.8%-24.9% on RULER and 18.6%-50.9% on LongBench over the state-of-the-art approaches (e.g., CacheBlend, EPIC, and KVShare).
💡 Research Summary
The paper tackles the severe latency bottleneck that occurs during the pre‑fill stage of long‑context Retrieval‑Augmented Generation (RAG). In RAG, a large language model (LLM) must ingest dozens to thousands of retrieved document chunks, inflating the input length to tens of thousands or even a million tokens. Because self‑attention scales quadratically with sequence length, the pre‑fill step dominates the overall time‑to‑first‑token (TTFT), making real‑time services impractical.
A straightforward way to reduce this cost is to reuse the key‑value (KV) cache that the model builds while processing the prompt. Traditional KV‑cache reuse relies on exact prefix matching, which rarely holds when document order changes. Position‑independent (PI) reuse relaxes this constraint by allowing pre‑computed chunk‑wise caches to be concatenated regardless of order, but naïvely concatenating them discards cross‑attention between chunks, causing dramatic drops in answer quality.
Recent work (EPIC, CacheBlend, KVShare) introduced partial recomputation: only a small subset of tokens is re‑processed to reconstruct the missing cross‑attention. These methods select tokens based on global attention weights, KV‑cache deviation, or hidden‑state deviation. While they recover some attention, the authors identify a fundamental “crowding‑out effect”: globally salient tokens that are irrelevant to the user query consume the limited recomputation budget, pushing out the truly query‑critical tokens. Empirically, this leads to up to an 86 % accuracy loss on representative benchmarks.
ProphetKV proposes a paradigm shift: instead of trying to approximate the full attention map, it focuses on the query‑driven portion of attention that actually matters for answer generation. The key insights are: (1) In RAG the user query is placed at the end of the prompt and its own attention distribution reliably predicts which context tokens will be attended during decoding; (2) Cross‑attention utility is therefore query‑contingent. By treating the query as a “prophet”, the system can extract a relevance signal directly from the query’s attention scores.
The method consists of two stages. Stage I – Query‑Guided Token Scoring: the model computes the attention weights from the query token to all context tokens in a single forward pass. Tokens with the highest query‑to‑context attention are assigned importance scores. Stage II – Dual‑Stage Re‑computation with Layer Fusion: for each transformer layer, additional metrics (e.g., Q‑K similarity, head‑wise importance) are gathered. A fusion algorithm aggregates these layer‑wise signals into a unified utility score, ensuring that tokens that become important only in deeper layers are not missed. The final set of tokens to recompute is the top‑p % of this fused score, where p is a small budget (e.g., 20 %).
Because only the selected tokens are recomputed, the cross‑attention between the query and the most relevant context is restored, while the majority of the KV cache remains untouched. This yields a dramatic reduction in FLOPs: the cost scales linearly with the number of recomputed tokens rather than quadratically with the full sequence length.
Experiments were conducted on multiple LLMs (Llama‑3‑8B‑Inst, Qwen2.5‑14B‑Inst, Qwen3‑14B‑Thor) and evaluated on the RULER and LongBench suites. With a recomputation ratio of just 20 %, ProphetKV achieves 96 %–101 % of the full‑pre‑fill accuracy, outperforming EPIC, CacheBlend, and KVShare by 8.8 %–24.9 % on RULER and 18.6 %–50.9 % on LongBench. The overlap ratio between selected tokens and the ground‑truth query‑attended tokens exceeds 0.85 across models, confirming that the query‑driven selection is highly faithful.
Importantly, ProphetKV is training‑free; it does not require any fine‑tuning of auxiliary models, making it a plug‑and‑play addition to existing KV‑cache pipelines. The approach incurs negligible runtime overhead beyond the modest extra forward pass needed to compute query attention scores.
In summary, ProphetKV replaces the unrealistic goal of reconstructing the entire missing attention map with a focused, query‑centric recomputation strategy. By leveraging the query’s own attention as a prophetic signal and fusing layer‑wise importance, it allocates the limited recomputation budget to the tokens that truly matter for answer generation. This yields near‑full accuracy with a fraction of the computational cost, offering a practical solution for deploying long‑context RAG systems at scale.
Comments & Academic Discussion
Loading comments...
Leave a Comment