ManifoldKV: Training-Free KV Cache Compression via Euclidean Outlier Detection
Long-context inference is constrained by KV-cache memory, which grows linearly with sequence length; KV-cache compression therefore hinges on reliably selecting which past tokens to retain. Most geometry-based eviction methods score keys by cosine similarity to a global centroid, but cosine is scale-invariant and can discard magnitude cues that distinguish semantically salient tokens. We propose ManifoldKV, a training-free scorer that ranks tokens by Euclidean distance to the key centroid, capturing both angular and radial deviations. On the RULER benchmark, ManifoldKV achieves 95.7% accuracy at 4K-16K contexts with 20% compression; matching the best geometric baseline while improving robustness in two regimes where cosine scoring fails. First, on multi-key retrieval, ManifoldKV reduces directional collisions, achieving 92.4% vs KeyDiff’s 77.0% (+15.4 points) on 3-key NIAH at 50% compression. Second, to address dilution and performance collapse of global centroids at 64K context, we introduce WindowedManifoldKV, which restores accuracy to 84.3% at 25% compression, a 49-point recovery over global L2 and +3.2 points over KeyDiff. The method requires only 3 lines of code and works across 4 architectures without tuning.
💡 Research Summary
The paper tackles the memory bottleneck of KV‑cache in transformer‑based large language models (LLMs) during long‑context inference. Because the KV‑cache grows linearly with the number of processed tokens, storing keys and values for 100 K tokens can require over 60 GB of memory, making deployment impractical. Existing compression strategies fall into two families: attention‑based eviction (e.g., SnapKV, H2O) that keeps tokens with high cumulative attention, and geometry‑based eviction (e.g., KeyDiff) that scores tokens by cosine similarity to the global key centroid. The latter discards magnitude information, so tokens that are radially far from the centroid but share its direction receive the same score as typical tokens, leading to loss of important entities and numbers.
ManifoldKV proposes a simple yet effective alternative: rank tokens by the squared Euclidean distance between each key vector (k_i) and the global centroid (\mu = \frac{1}{N}\sum_i k_i). The score (s_i = |k_i - \mu|_2^2) expands to (r_i^2 + |\mu|^2 - 2 r_i |\mu| \cos\theta_i), where (r_i = |k_i|) and (\theta_i) is the angle between (k_i) and (\mu). Thus the score simultaneously captures (a) radial deviation (magnitude), (b) angular deviation, and (c) a constant term. Tokens that are outliers in either dimension receive high scores and are retained. The algorithm runs in (O(Nd + N\log N)) time—negligible compared to the (O(N^2 d)) cost of attention—adding less than 0.5 ms latency at 64 K context length.
Empirical evaluation uses the RULER benchmark, a synthetic suite covering context lengths from 4 K to 128 K tokens and tasks such as Needle‑in‑a‑Haystack (NIAH) retrieval and word‑extraction. With 20 % compression (i.e., retaining 80 % of tokens) on 4 K–16 K contexts, ManifoldKV achieves 95.7 % accuracy, outperforming the best prior geometric method (KeyDiff at 81.1 %). In the challenging 3‑key NIAH scenario at 50 % compression, ManifoldKV reaches 92.4 % versus KeyDiff’s 77.0 %, a 15.4‑point gain, demonstrating that magnitude information prevents directional collisions when multiple important tokens share similar directions.
However, at very long contexts (≥ 64 K tokens) the global centroid becomes a “center of mass” over many semantic clusters, a phenomenon the authors call the Centroid Dilution Problem. The centroid loses semantic meaning, and all tokens become approximately equidistant from it, causing the L2 scores to lose discriminative power. Accuracy collapses from 82.3 % at 32 K to 35.2 % at 64 K under the same compression ratio.
To remedy this, the authors introduce WindowedManifoldKV. The method partitions the key sequence into sliding windows (e.g., 4 K tokens), computes a local centroid for each window, and scores tokens within the window using the same L2 distance. Because each window spans a coherent semantic segment, the local centroid remains meaningful, preserving the ability to separate outliers. On 64 K contexts with 25 % compression, WindowedManifoldKV restores accuracy to 84.3 %, a 49‑point improvement over global L2 and a modest 3.2‑point edge over KeyDiff.
A further contribution is the observation that key vectors across diverse models lie on a low‑dimensional (~9‑D) manifold, as estimated by the Two‑NN intrinsic‑dimension method. This explains why the same Euclidean scoring works zero‑shot across four architectures (Llama‑3.1‑8B, Llama‑2‑70B, Mistral‑7B, Gemma‑2B) with negligible variance (±0.3 %). The method requires only three lines of code (centroid computation, distance calculation, Top‑K selection) and integrates seamlessly with existing KV‑cache pipelines, incurring minimal computational overhead.
In summary, the paper identifies a critical flaw in cosine‑based KV‑cache eviction—its ignorance of magnitude—and replaces it with an L2‑based outlier detector that captures both angular and radial deviations. The basic ManifoldKV excels on short‑to‑medium contexts, while the Windowed extension overcomes centroid dilution on ultra‑long contexts. The approach is model‑agnostic, extremely lightweight, and delivers state‑of‑the‑art compression performance without any training or hyper‑parameter tuning.
Comments & Academic Discussion
Loading comments...
Leave a Comment