More Than a Quick Glance: Overcoming the Greedy Bias in KV-Cache Compression
While Large Language Models (LLMs) can theoretically support extensive context windows, their actual deployment is constrained by the linear growth of Key-Value (KV) cache memory. Prevailing compression strategies mitigate this through various pruning mechanisms, yet trade-off semantic recall for memory efficiency. In this work, we present LASER-KV (Layer Accumulated Selection with Exact-LSH Recall), a framework designed to test the limits of KV compression under a strict accumulative budgeting policy. We deviate from the standard fixed summary size approach by implementing a block-wise accumulation strategy governed by a protection divisor (n). This allows us to isolate the effects of compression from sliding window artifacts. Our experiments on the Babilong benchmark reveal performance degradation in previous compression methods by 15-30% on various long context tasks. LASER-KV maintains stable performance, achieving superior accuracies by a margin of upto 10% at 128k. These findings challenge the prevailing assumption that attention scores alone are a sufficient proxy for token utility.
💡 Research Summary
The paper addresses a fundamental bottleneck in deploying large language models (LLMs) with long context windows: the linear growth of the key‑value (KV) cache as the sequence length increases. Existing KV‑cache compression techniques—such as sliding‑window eviction, attention‑score‑based pruning, or external memory offloading—typically rely on a greedy selection criterion that uses the current query’s attention scores alone. While this works for short‑range relevance, it often discards tokens that are structurally important for future queries, leading to severe performance degradation on long‑context reasoning tasks.
To overcome this limitation, the authors introduce LASER‑KV (Layer Accumulated Selection with Exact‑LSH Recall), a two‑pronged framework that (1) decouples compression from sliding‑window artifacts through an accumulative, block‑wise budgeting policy governed by a single hyper‑parameter called the protection divisor n, and (2) replaces pure attention‑score selection with a hybrid Exact‑LSH policy that combines exact attention scores (Exact) with Locality Sensitive Hashing (LSH) based scoring (MagicPIG).
Accumulative Budgeting and Protection Divisor.
The input sequence is split into fixed‑size blocks (S_block = 4096 tokens). For each block a total budget B is allocated. The protection divisor n partitions B into three logical pools: (i) a “syntactic set” of size 2B/n, further divided equally into (a) global anchors (B/n tokens) that preserve the earliest tokens and act as attention sinks, and (b) a local sliding window (B/n tokens) that maintains grammatical coherence; (ii) a “recall budget” of size B − 2B/n, which is dedicated to long‑term memory. By lowering n, the local window becomes larger, stabilizing generation; raising n tightens the window and forces more aggressive pruning. This explicit control over the recent‑vs‑historical token ratio is a novel contribution that directly addresses generation stability.
Exact‑LSH Selection Policy.
The policy operates in two stages. First, attention scores are summed across all layers and heads for each candidate token, producing a global relevance score S_exact. The top α·B_long tokens (α ∈
Comments & Academic Discussion
Loading comments...
Leave a Comment