Beyond Token Eviction: Mixed-Dimension Budget Allocation for Efficient KV Cache Compression
Key-value (KV) caching is widely used to accelerate transformer inference, but its memory cost grows linearly with input length, limiting long-context deployment. Existing token eviction methods reduce memory by discarding less important tokens, which can be viewed as a coarse form of dimensionality reduction that assigns each token either zero or full dimension. We propose MixedDimKV, a mixed-dimension KV cache compression method that allocates dimensions to tokens at a more granular level, and MixedDimKV-H, which further integrates head-level importance information. Experiments on long-context benchmarks show that MixedDimKV outperforms prior KV cache compression methods that do not rely on head-level importance profiling. When equipped with the same head-level importance information, MixedDimKV-H consistently outperforms HeadKV. Notably, our approach achieves comparable performance to full attention on LongBench with only 6.25% of the KV cache. Furthermore, in the Needle-in-a-Haystack test, our solution maintains 100% accuracy at a 50K context length while using as little as 0.26% of the cache.
💡 Research Summary
The paper tackles the growing memory burden of key‑value (KV) caches in transformer‑based large language models, especially when processing very long contexts. Traditional KV‑cache compression methods fall into two categories: token eviction, which discards entire tokens (effectively assigning them zero dimensions), and uniform dimensionality reduction, which compresses every token to the same lower dimension. Both approaches are coarse‑grained and waste potential information‑preserving capacity.
To overcome this limitation, the authors propose MixedDimKV, a mixed‑dimension KV‑cache compression framework that allocates a distinct number of dimensions to each token based on its estimated importance. A variant, MixedDimKV‑H, further incorporates head‑level importance scores, enabling even finer‑grained budgeting across attention heads.
The core pipeline consists of four stages:
- Candidate compression ratios – a predefined set (e.g., 0 %, 25 %, 50 %, 75 %, 100 %) that specifies possible fractional budgets for keys and values.
- Loss‑score computation – for every token‑ratio pair the authors define an “accuracy loss score” that upper‑bounds the deviation in the attention output caused by compression. The score combines two terms: (a) the change in attention scores multiplied by the original value vectors, and (b) the original attention scores multiplied by the change in value vectors. This captures both attention importance and value magnitude, unlike earlier methods that rely solely on attention weights. The loss is efficiently computed in batch using a simple algorithm that first obtains the original attention matrix P, then recomputes it after compressing K and V with PCA, and finally forms an error matrix E = |P′‑P|·‖V‖² + P·‖V‑V′‖². Summing E over all queries yields a per‑token loss vector L.
- Discrete budget allocation – the problem is formalized as minimizing the total loss subject to a global dimension budget B (e.g., 6.25 % of the original KV size). Although the decision variables are discrete, the authors show that the primal‑dual gap is bounded by a small constant independent of sequence length, allowing them to solve the Lagrangian dual instead. Introducing a multiplier λ ≥ 0, each token’s sub‑problem becomes (\min_{d\in D} L_i(d) + λ d). Because L_i(d) is monotonic decreasing in d, the total dimension consumption C(λ) = Σ_i d_i*(λ) is a monotone step function of λ. Consequently, an efficient bisection search over λ finds the λ* that makes C(λ*) ≈ B, and the corresponding per‑token dimensions d_i* are the final allocation.
- Head‑wise PCA compression – rather than concatenating all heads and performing a joint low‑rank decomposition, the authors compress each head independently. This reduces the storage overhead of projection matrices from O((H_kv·D)²·ρ) to O(H_kv·D²·ρ) and preserves the ability to allocate different budgets per head, which is crucial when head importance varies. The head‑wise approach also avoids the need to reconstruct the full KV cache for attention computation, keeping inference latency low.
MixedDimKV‑H extends this pipeline by feeding head‑level importance scores (obtained from a separate profiler) into the loss‑score calculation, thereby biasing the allocation toward heads that contribute more to the final output.
Experimental evaluation is performed on several long‑context benchmarks: LongBench, RULER, and a Needle‑in‑a‑Haystack retrieval test. Key findings include:
- On LongBench, MixedDimKV achieves performance comparable to full‑attention models while using only 6.25 % of the KV cache (KV size = 512).
- Without any head‑level profiling, MixedDimKV outperforms prior token‑eviction and uniform‑dimensionality methods.
- When equipped with the same head‑importance information, MixedDimKV‑H consistently surpasses the state‑of‑the‑art HeadKV method.
- In the Needle‑in‑a‑Haystack scenario with a 50 K token context, the method maintains 100 % accuracy while using merely 0.26 % of the cache (KV size = 128).
- Decoding latency per token drops to roughly 55 % of the baseline full‑attention runtime, demonstrating practical speed‑up.
The authors also provide a theoretical justification (Theorem 4.1) that the duality gap is negligible, ensuring that the bisection‑based solution is near‑optimal for long sequences.
Implications and limitations: MixedDimKV introduces a new degree of freedom—per‑token dimension budgeting—that bridges the gap between coarse token eviction and uniform low‑rank compression. By coupling a principled loss metric with an efficient discrete optimization, it delivers strong memory‑efficiency without sacrificing accuracy, making it attractive for deployment of LLMs on devices with limited GPU memory or for serving many concurrent long‑context requests. However, the current implementation relies on linear PCA projections; exploring non‑linear or quantization‑based compressions could further improve the trade‑off. Additionally, the set of candidate ratios and the loss‑score hyperparameters may need tuning for different model architectures or downstream tasks.
In summary, the paper presents a well‑motivated, theoretically grounded, and empirically validated framework that advances KV‑cache compression beyond binary token eviction, offering a practical path toward scalable, long‑context inference in large language models.
Comments & Academic Discussion
Loading comments...
Leave a Comment