KV-CoRE: Benchmarking Data-Dependent Low-Rank Compressibility of KV-Caches in LLMs
Large language models rely on kv-caches to avoid redundant computation during autoregressive decoding, but as context length grows, reading and writing the cache can quickly saturate GPU memory bandwidth. Recent work has explored KV-cache compression, yet most approaches neglect the data-dependent nature of kv-caches and their variation across layers. We introduce KV-CoRE KV-cache Compressibility by Rank Evaluation), an SVD-based method for quantifying the data-dependent low-rank compressibility of kv-caches. KV-CoRE computes the optimal low-rank approximation under the Frobenius norm and, being gradient-free and incremental, enables efficient dataset-level, layer-wise evaluation. Using this method, we analyze multiple models and datasets spanning five English domains and sixteen languages, uncovering systematic patterns that link compressibility to model architecture, training data, and language coverage. As part of this analysis, we employ the Normalized Effective Rank as a metric of compressibility and show that it correlates strongly with performance degradation under compression. Our study establishes a principled evaluation framework and the first large-scale benchmark of kv-cache compressibility in LLMs, offering insights for dynamic, data-aware compression and data-centric model development.
💡 Research Summary
Large language models (LLMs) rely on key‑value (KV) caches to avoid recomputing attention outputs during autoregressive decoding. As the context window grows, the repeated reads and writes to the KV‑cache become a serious memory‑bandwidth bottleneck on GPUs. Existing KV‑cache compression methods mostly target the projection matrices themselves and treat all layers and inputs uniformly, ignoring the fact that the actual rank of the cached keys and values is highly data‑dependent and varies across layers.
KV‑CoRE (KV‑cache Compressibility by Rank Evaluation) addresses these gaps with a three‑part framework. First, it computes an incremental singular value decomposition (SVD) of the key and value activations for each layer over an entire dataset. By accumulating a covariance matrix C = Σ_t k_tᵀ k_t token‑by‑token, the method avoids storing the full KV matrix (which would be O(l · d) in size) and instead requires only O(d²) memory, where d is the hidden dimension per head. After processing all tokens, an eigen‑decomposition of C yields the exact singular values Σ and right singular vectors V of the KV matrix, guaranteeing mathematically equivalent results to a full SVD.
Second, KV‑CoRE leverages the Eckart‑Young‑Mirsky theorem to obtain the optimal low‑rank approximation under the Frobenius norm. For a chosen rank k, the optimal compression matrix is f_WK = W_K V_k V_kᵀ, where V_k contains the top‑k right singular vectors. In practice this translates to replacing the original projection W_K with a down‑projection W_K V_k and an up‑projection V_kᵀ, allowing the cache to store only k‑dimensional key vectors while preserving the best possible reconstruction error.
Third, the paper introduces Normalized Effective Rank (NER) as a lightweight, data‑driven compressibility metric. Effective rank erank(K) = exp(−∑_i p_i log p_i) treats the normalized singular values p_i = σ_i/∑_j σ_j as a probability distribution and measures its entropy. NER = erank(K)/r (r = actual rank) normalizes this value to the interval
Comments & Academic Discussion
Loading comments...
Leave a Comment