KVSculpt: KV Cache Compression as Distillation

KV cache compression is critical for efficient long-context LLM inference. Approaches that reduce the per-pair footprint -- quantization and low-rank decomposition -- are orthogonal to those that reduce the sequence length of the cache. Along the seq…

Authors: Bo Jiang, Sian Jin

KVSculpt: KV Cache Compression as Distillation
K V S C U L P T : KV Cache Compr ession as Distillation Bo Jiang T emple Uni versity bo.jiang@temple.edu Sian Jin T emple Uni versity sian.jin@temple.edu Abstract KV cache compression is critical for ef ficient long-context LLM inference. Approaches that reduce the per -pair footprint—quantization and low-rank decomposition—are orthogonal to those that reduce the sequence length of the cache. Along the sequence-length dimension, existing methods range from pure eviction— selecting which KV pairs to keep—to merging, which combines similar pairs into fewer ones. Both remain anchored to the original cache en- tries. W e propose K V S C U L P T , which moves to the other end of this spectrum: instead of selecting or combining original pairs, we opti- mize a smaller set of unconstrained KV pairs in continuous embedding space to preserve each layer’ s attention behavior . Keys are optimized via L-BFGS and v alues are solved in closed form via least squares, alternating e very fe w steps. On top of this, we introduce adaptive budg et allocation , which uses a cheap pilot compression run to redistribute the compres- sion budget across layers and KV heads based on per-component dif ficulty . On Qwen2.5-1.5B-Instruct with 2048-token contexts, K V S C U L P T reduces KL div er- gence by 3 . 5 – 4 . 1 × compared to Select+Fit— attention-score eviction with least-squares value fitting—across compression ratios r ∈ { 0 . 3 , 0 . 5 , 0 . 7 } . Adaptiv e allocation provides an additional 1 . 3 × KL reduction at no extra inference cost. Analysis rev eals that compres- sion difficulty is highly non-uniform: per-layer pilot MSE v aries by up to 100 × across layers, and the two KV heads within a single layer can dif fer by up to 467 × —demonstrating that fine-grained budget allocation is essential. 1 Introduction Autoregressi ve large language models (LLMs) cache ke y-v alue (KV) pairs from all pre viously generated tokens to a void recomputation during in- ference ( V aswani et al. , 2017 ). For long contexts, this KV cache becomes the dominant memory bot- tleneck, consuming tens of gigabytes for a single sequence ( Kwon et al. , 2023 ). KV cache compres- sion is therefore essential for practical deployment. KV cache compression broadly falls into two dimensions: reducing the size of each pair (quan- tization, low-rank) and reducing the number of pairs (sequence length). Along the sequence-length dimension, methods range from pure eviction to merging. Eviction selects k pairs to keep: criteria include attention score accumulation ( Zhang et al. , 2023 ), recency with attention sinks ( Xiao et al. , 2024 ), persistence of importance ( Liu et al. , 2023 ), and pyramidal allocation ( Cai et al. , 2024 ). Mer g- ing combines similar pairs, modifying values b ut remaining anchored to the original cache structure. A hybrid variant fits ne w v alues for the selected positions via least squares ( De voto et al. , 2024 ), but the ke ys —and thus which re gions of embed- ding space are represented—remain constrained to original cache entries. W e argue that this discrete selection frame work is unnecessarily restricti ve. After RoPE encoding ( Su et al. , 2024 ), KV pairs are vectors in a continu- ous embedding space with no inherent ordering— their positional information is already baked into the embeddings. There is no reason the compressed cache must be a subset of the original; any set of k vectors that reproduces the correct attention behav- ior is equally v alid. W e propose K V S C U L P T , which reformulates KV cache compression as distillation ( Hinton et al. , 2015 ) into a smaller cache (Figure 1 ). Giv en a full KV cache of N pairs per layer , we optimize k unconstrained ke y-v alue pairs such that the atten- tion output under the compressed cache matches the original. Ke ys are optimized with L-BFGS ( Liu and Nocedal , 1989 ), a quasi-Ne wton method well-suited to the smooth but non-con vex attention landscape; v alues are solved analytically via ridge regression, gi ven the attention weights induced by 1 the current keys. Optimization is per -layer and per-head, enabling tri vial parallelism and bounded memory . Beyond the core optimizer , we introduce adap- tive budget allocation : a short pilot compression run re veals the per -layer and per-head compression dif ficulty , which is then used to redistribute the fixed total b udget. Layers and heads that are harder to compress recei ve more pairs; easy ones recei ve fe wer . At inference time this is free—the same total budget, just redistributed—and the pilot cost is amortized into the compression step. Our contributions: 1. W e reformulate KV cache compression from discrete selection to distillation, eliminating the combinatorial search ov er positions. 2. W e propose an L-BFGS + least-squares alternat- ing optimizer that achiev es 3 . 5 – 4 . 1 × lo wer KL di ver gence than the best eviction baseline. 3. W e introduce pilot-based adapti ve allocation at the layer and head granularity , with per-layer redistribution alone yielding 1 . 3 × lo wer KL at no extra inference cost. 4. W e provide analysis sho wing that compression dif ficulty is highly structured—v arying by or- ders of magnitude across layers and KV heads— and that per-layer errors compound through the transformer , identifying the bottleneck for fu- ture work. 2 Related W ork Sequence-length reduction. Eviction methods select a subset of KV pairs and discard the rest. H2O ( Zhang et al. , 2023 ) accumulates attention scores across queries and evicts the lo west-scoring pairs. StreamingLLM ( Xiao et al. , 2024 ) keeps an attention sink windo w plus recent tokens. Scis- sorHands ( Liu et al. , 2023 ) exploits the persistence of importance across decoding steps. PyramidKV ( Cai et al. , 2024 ) allocates different cache sizes per layer based on attention pattern structure. FastGen ( Ge et al. , 2024 ) uses attention profiling to decide per-head compression policies. Rather than dis- carding pairs entirely , merging methods combine similar ones: CaM ( Zhang et al. , 2024 ) merges low- importance pairs into neighboring important ones via weighted averaging, D2O ( W an et al. , 2025 ) dis- tinguishes active and passiv e tokens and merges the passi ve ones, and DMC ( Nawrot et al. , 2024 ) learns a per -head gate that decides whether to append or merge each incoming tok en. Both families remain anchored to the original cache entries—eviction keeps a subset unchanged, and mer ging produces weighted combinations of the originals. KV cache quantization and low-rank meth- ods. Orthogonal lines of work reduce memory by quantizing KV v alues to lower precision ( Hooper et al. , 2024 ) or exploiting lo w-rank structure in the cache. These approaches are complementary to ours: quantization can be applied on top of the distilled cache, and lo w-rank compression can be combined with pair reduction. Adaptive per -layer budgets. PyramidKV ( Cai et al. , 2024 ) and PyramidInfer ( Y ang et al. , 2024 ) allocate different cache sizes per layer based on attention entropy . De voto et al. ( 2024 ) use L2 norm of values as a compression dif ficulty signal. Our work shares the insight that uniform allocation is suboptimal, but differs in signal (dynamic pilot MSE vs. static attention patterns) and scope (we also allocate across KV heads within a layer). 3 Method 3.1 Pr oblem Formulation Consider a single KV head in an attention layer with h q query heads and h kv KV heads, where the head serves g = h q /h kv query heads (GQA group size). After processing a conte xt of N tokens, the head holds a full KV cache that we partition into a compr ess zone (the oldest N − m pairs) and a retain zone (the most recent m pairs, k ept unchanged): K full = [ K old | {z } N − m ; K ret |{z} m ] , V full = [ V old | {z} N − m ; V ret |{z} m ] (1) where K full , V full ∈ R N × d and all ke ys include RoPE positional encoding. The goal is to distill the compress zone into k freely optimized pairs ( K c , V c ) ∈ R k × d such that the compressed cache K cat = [ K c ; K ret ] , V cat = [ V c ; V ret ] (2) preserves the attention output for an y future query . The compression ratio is r = ( k + m ) / N . Relationship to eviction and merging . Eviction constrains K c to a k -element subset of ro ws of K old ; merging restricts each row to a weighted com- bination of original rows. Our feasible set contains 2 (a) Eviction compress zone retain select top- k subset of original (b) Merge compress zone retain group & avg avg of neighbors (c) K V S C U L P T (Ours) compress zone retain L-BFGS (K) lstsq (V) free in R d Ke ys constrained to: original positions (a) → combinations of originals (b) → unconstrained R k × d (c) Figure 1: Three paradigms for KV cache sequence-length reduction. (a) Eviction selects a subset of original KV pairs. (b) Merge combines similar pairs, modifying v alues b ut remaining anchored to original positions. (c) K V S C U L P T distills the compress zone into k unconstrained pairs (orange) freely optimized in R d via L-BFGS (keys) and least squares (values). All three keep the retain zone (green) intact. The key distinction is the de gree of freedom: from discrete subset to weighted combination to fully continuous optimization. both: an y e viction or merge solution lies in R k × d , so in principle our optimum cannot be worse. More broadly , our formulation is a direct answer to the sequence-length reduction problem itself—giv en a budget of k pairs, find the k pairs in R k × d that best preserve attention beha vior , with no structural con- straints. Since RoPE already encodes position into the ke y embedding, K c is free to land at “virtual” positions that need not correspond to an y original token. In practice, the non-con vexity of the soft- max landscape means the global optimum is not guaranteed, but L-BFGS with warm-start initial- ization consistently finds solutions that outperform both e viction and merging (Section 5 ). 3.2 Loss Function W e optimize K c and V c to match the full-cache at- tention output for a set of training queries Q ∈ R g × n q × d (construction detailed in Section 3.3 ). Follo wing the chunked attention decomposition used in FlashAttention ( Dao et al. , 2022 ), the con- text chunk’ s contribution to any future decode step is fully determined by the partial output o and the log-sum-exp ℓ = log P exp( scores ) (which sub- sumes the max score µ ). As long as these match between the compressed and full cache, the final combined output is identical regardless of the fu- ture decode chunk. This moti v ates a two-term loss: L = ∥ ˆ Y − Y ∥ 2 F | {z } output MSE + ∥ ˆ ℓ − ℓ ∥ 2 F | {z } LSE matching (3) where Y = softmax( QK ⊤ full / √ d ) V full is the full- cache output, ˆ Y is the compressed-cache output, and ℓ = LSE( QK ⊤ full / √ d ) is the log-sum-exp of the full-cache scores (and ˆ ℓ the compressed counterpart). The LSE term ensures that the atten- tion mass assigned to the conte xt chunk is correct, which is critical when the context chunk is later combined with future decode tok ens. W e weight both terms equally ( λ = 1 ); in practice they are comparable in magnitude because both are normal- ized by the number of queries. 3.3 T raining Query Construction The loss in Eq. 3 requires a set of training queries, but the actual future decode queries are unav ailable at compression time. A natural proxy is the r etain queries —the m real queries from the retain zone— since they are the most recent and thus closest to fu- ture decode queries. Ho we ver , retain queries carry RoPE at positions [ N − m, . . . , N − 1] , while future decode queries will ha ve positions [ N , N +1 , . . . ] ; this positional mismatch can bias the optimization. De-RoPE factorization. Since q = RoPE ( q c , p ) where q c = W q h is a position- independent content vector , we can recov er q c = RoPE − 1 ( q , p ) by in verting the rotation. Empirically , q c is approximately stationary: consecuti ve content v ectors have cosine similarity 0 . 91 – 0 . 93 , and a PCA basis fitted on context tokens captures 81 – 86% of decode v ariance, with only ∼ 2% decay per 2048 tokens. The ef fecti ve dimensionality is ∼ 60 out of 128, indicating a concentrated lo w-dimensional structure. Synthetic futur e queries. W e uniformly sub- sample n s content vectors across the full con- text, then re-apply RoPE at future positions 3 N , N +1 , . . . , N + n s − 1 : Q synth = RoPE  q c [ uniform_indices ] , [ N , . . . , N + n s − 1]  (4) Uniform sampling provides broad temporal cov- erage of the content distribution. W e tested alter- nati ve strate gies (bootstrap from recent tokens, k - means centroids, farthest-point sampling, PCA ex- trapolation with subspace rotation); none improv ed ov er uniform, because the distributional drift sig- nals are too small in magnitude to reliably exploit from context alone (Section 5.3 ). Final training set. The training queries are the union of all m retain queries (at their original positions) and n s synthetic future queries, giv- ing n q = m + n s total queries per query head ( Q ∈ R g × n q × d in Eq. 3 ). The retain queries anchor the optimization to real attention patterns, while the synthetic queries improv e generalization to future positions. 3.4 Optimization The loss in Eq. 3 is dif ferentiable w .r .t. K c (through the softmax) but has a fav orable structure for V c : gi ven fix ed attention weights, the optimal V c is a linear least-squares solution. Alternating K-optimization and V -solve. W e alternate between: 1. K step : update K c via L-BFGS ( Liu and No- cedal , 1989 ) with V c frozen. L-BFGS uses cur- v ature information from gradient history , which is critical for na vigating the non-con vex soft- max landscape—first-order methods (Adam) get trapped in poor local minima (Section 5.2 ). 2. V step (e very 5 K steps): solve V ∗ c = arg min V c ∥ A c V c + A r V ret − Y ∥ 2 F + λ r ∥ V c ∥ 2 F via ridge regression ( λ r = 10 − 3 ), where A = softmax( QK ⊤ cat / √ d ) ∈ R n q × ( k + m ) is parti- tioned as [ A c ; A r ] ov er the compressed and retained ke y positions. Initialization. W e initialize K c from the top- k positions by attention importance score (sum of softmax attention weights across all queries), which provides a w arm start near a good basin. Per -head independence. With grouped-query at- tention ( Ainslie et al. , 2023 ), each KV head serves a group of query heads independently . The loss decomposes across KV heads, so we optimize each head separately . This enables per-head b udget allo- cation (Section 3.5 ). 3.5 Adaptive Budget Allocation The standard approach allocates the same num- ber of compressed pairs k to e very layer and head. Ho we ver , compression difficulty varies dramati- cally across components. Pilot-MSE signal. W e run a short pilot com- pression with uniform allocation to obtain a per - component MSE signal (60 L-BFGS steps for per- layer allocation; 30 steps per head for per -head allo- cation). This signal captures the sequence-specific compression difficulty of each layer and head— unlike static signals such as value norm or attention entropy , which are model properties independent of the input. T wo-lev el allocation. Gi ven a fix ed total budget B = k × L × h kv (where L is the number of layers): 1. Per -layer : compute mean pilot MSE across heads for each layer; allocate layer budgets proportional to MSE 0 . 5 l (square-root dampen- ing pre vents outlier layers from consuming the entire budget). 2. Per -head within layer : giv en the layer budget, redistribute across KV heads proportional to MSE 0 . 5 l,h . The square-root dampening is critical: undamp- ened MSE ( α = 1 . 0 ) causes ov er-allocat ion to out- lier layers (e.g., Layer 0, whose MSE is 10 – 100 × the median), starving other layers and worsening ov erall quality . 4 Experimental Setup Model and data. W e use Qwen2.5-1.5B-Instruct ( Qwen T eam , 2025 ), a 28-layer model with grouped-query attention (12 query heads, 2 KV heads, head dimension 128). W e sample sequences from the PG19 test set with context length N = 2048 and ev aluate on 128 continuation tokens. Baseline experiments use 20 sequences; detailed analysis and ablations use a 5-sequence subset. Evaluation metric. W e measure KL diver gence between the output logits under compressed KV and the ground-truth logits under the full cache, computed o ver the 128 continuation tok ens with teacher forcing. Lo wer KL indicates better preser- v ation of the model’ s output distribution. Compression ratios. W e test r ∈ { 0 . 1 , 0 . 2 , 0 . 3 , 0 . 5 , 0 . 7 } . The retain zone is 4 T able 1: KL diver gence ( ↓ ) on 5 sequences (mean). K V S C U L P T consistently outperforms all e viction base- lines by a wide margin. Method r =0 . 3 r =0 . 5 r =0 . 7 Random 2 . 50 1 . 80 1 . 54 Attn Score 2 . 54e - 1 2 . 04e - 1 1 . 39e - 1 Select+Fit 2 . 33e - 1 1 . 86e - 1 1 . 25e - 1 Joint Opt 2 . 24e - 1 1 . 80e - 1 1 . 15e - 1 K V S C U L P T 5 . 75 e - 2 4 . 63 e - 2 3 . 58 e - 2 vs. Select+F it 4 . 1 × 4 . 0 × 3 . 5 × m = 256 tokens (the most recent 12.5% of the context). Baselines. W e compare against four methods: • Random : randomly select k positions to keep. • Attention Score : keep the top- k positions by accumulated attention score across all queries ( Zhang et al. , 2023 ). • Select+Fit : select positions by attention score, then fit values via ridge regression. Inspired by De voto et al. ( 2024 ), who use L 2 norm for selec- tion; we substitute attention-score selection for a stronger baseline. • Joint Optimization : learn binary gates via Hard Concrete relaxation ( Louizos et al. , 2018 ) jointly with KV v alues; the discrete selection analogue of our method. K V S C U L P T configuration. L-BFGS with learn- ing rate 0.5, strong W olfe line search, 10 inner iter- ations per step; 100 outer steps per layer; V solved via ridge regression ( λ = 10 − 3 ) ev ery 5 steps; 128 synthetic future queries; retain zone m = 256 . 5 Results 5.1 Main Results: Distillation vs. Eviction T able 1 compares K V S C U L P T against baselines across compression ratios. K V S C U L P T achie ves 3 . 5 – 4 . 1 × lo wer KL than Select+Fit, the strongest e viction baseline, across all tested ratios. The adv antage is most pronounced at aggressi ve compression ( r = 0 . 3 , 4 . 1 × ) where the discrete selection problem is hardest. The r ole of continuous keys. Joint Optimization uses the same per -layer MSE objectiv e as K V S - C U L P T and also optimizes values, but constrains ke ys to original cache positions via Hard Concrete gates. It barely impro ves ov er Select+Fit ( 2 . 24 vs. T able 2: Per-sequence KL at r =0 . 3 . K V S C U L P T achiev es near-lossless compression on easy sequences and remains superior on hard ones. Seq Select+Fit K V S C U L P T Impr ovement 0 (easy) 1 . 28e - 1 8 . 1e - 3 15 . 8 × 1 (hard) 1 . 39e - 1 1 . 17e - 1 1 . 2 × 2 (hard) 2 . 61e - 1 1 . 43e - 1 1 . 8 × 3 (easy) 3 . 92e - 1 9 . 7e - 3 40 . 5 × 4 (easy) 2 . 46e - 1 9 . 6e - 3 25 . 5 × Mean 2 . 33e - 1 5 . 75e - 2 4 . 1 × 2 . 33e - 1 at r =0 . 3 ), showing that optimizing values alone is insufficient —the ke y advantage of K V S - C U L P T is moving keys freely in R d , not just fitting better v alues. Per -sequence variation. W e label sequences with K V S C U L P T KL < 0 . 01 at r =0 . 3 as easy (3 of 5) and the rest as har d . Select+Fit remains at KL > 0 . 1 for all sequences, including the easy ones. On harder sequences (Seq 1, Seq 2), the advantage narro ws to 1 . 2 – 1 . 8 × but ne ver re verses (T able 2 ). 5.2 L-BFGS vs. Adam and Near -Optimality The choice of optimizer is critical. Replacing L- BFGS with Adam (the standard first-order opti- mizer for neural network training) degrades per- layer MSE by 17 – 95 × (T able 3 ). This translates to 8 – 15 × w orse end-to-end KL on easy sequences. The softmax in the attention computation creates a loss landscape with sharp, narrow valle ys. L- BFGS’ s curvature approximation navigates these ef ficiently , while Adam’ s diagonal preconditioning is insuf ficient. Near -optimality . T o assess ho w close L-BFGS gets to the best achiev able per-layer solution, we run an oracle search: 100 random restarts with the best result taken as an empirical upper bound. T a- ble 3 shows that a single L-BFGS run (1 restart) already reaches within 2–8% of the 100-restart or- acle. This near-optimality has an important im- plication: the per-layer optimization pr oblem is essentially solved . The remaining g ap to perfect end-to-end quality is not due to suboptimal per- layer compression, but due to error accumulation across layers—a limitation of the per-layer objec- ti ve itself (Section 6 ). 5.3 Query Strategy Ablation W e compare fi ve strategies for constructing the syn- thetic future queries described in Section 3.3 , ev al- 5 T able 3: Per-layer MSE ( r =0 . 3 ). L-BFGS (1 restart) is 17 – 95 × better than Adam and within 2–8% of the 100-restart oracle, indicating near-optimal per -layer so- lutions. Layer Adam L-BFGS Oracle Gap L2 1 . 78e - 3 1 . 05e - 4 9 . 67e - 5 8% L14 7 . 97e - 2 1 . 28e - 3 1 . 20e - 3 7% L26 1 . 14e - 1 1 . 20e - 3 1 . 18e - 3 2% T able 4: Query sampling strategies. Uniform spread provides the best ov erall trade-off, especially at far hori- zons. Strategy Near attn cos Far attn cos Bootstrap (last 128) 0.950 0.737 Uniform spread 0.948 0.759 Random sample 0.948 0.756 k -means centroids 0.947 0.756 Farthest-point 0.925 0.758 uated on attention cosine similarity at near (128 tokens) and far (4096 tok ens) horizons (T able 4 ). Bootstrap (sampling recent tokens only) wins at the near horizon but de grades at far horizons be- cause the content vector distrib ution shifts slightly ov er time. Uniform spread sacrifices 0 . 2% near- term accuracy for 3% far -term improvement by cov ering the full temporal span of the context. Ad- v anced strategies (PCA extrapolation, subspace ro- tation prediction, norm correction) were tested but none impro ved over uniform—the distrib utional drift signals ( ∼ 2% subspace rotation per 2048 to- kens, 5% norm gro wth) are too small to reliably exploit. 5.4 Adaptive Budget Allocation The preceding sections establish the core optimizer; we no w turn to how to distrib ute the compr ession budg et . Since optimization is per-head, we can measure each component’ s difficulty and allocate accordingly . Static signals fail across sequences. W e first test allocation based on value norm ( ∥ V ∥ ), a static model property . While ∥ V ∥ 0 . 5 -weighted allocation improv es Seq 1 by 25%, it worsens Seq 2 by 73% (T able 5 ). The mean KL across 5 sequences is 27% worse than uniform. The problem is that ∥ V ∥ is a model property in variant across sequences, but compression dif ficulty is sequence-dependent. Pilot-MSE allocation. A 60-step pilot run at uni- form allocation produces a per-layer MSE signal T able 5: Static allocation ( ∥ V ∥ 0 . 5 ) is unstable across sequences. Multipliers are relati ve to uniform allocation ( r =0 . 3 ). Seq Uniform KL ∥ V ∥ 0 . 5 vs. Uni. 0 8 . 1e - 3 1 . 05 × 1 1 . 17e - 1 0 . 75 × 2 1 . 43e - 1 1 . 73 × 3 9 . 7e - 3 0 . 84 × 4 9 . 6e - 3 1 . 23 × Mean 5 . 75e - 2 1 . 27 × T able 6: Pilot-MSE 0 . 5 allocation wins 4/5 sequences with 25% mean KL reduction ( r =0 . 3 ). Seq Uniform Pilot-MSE 0 . 5 vs. Uniform 0 8 . 14e - 3 8 . 93e - 3 1 . 10 × 1 1 . 170e - 1 7 . 01e - 2 0 . 60 × 2 1 . 433e - 1 1 . 208e - 1 0 . 84 × 3 9 . 67e - 3 8 . 64e - 3 0 . 89 × 4 9 . 59e - 3 7 . 77e - 3 0 . 81 × Mean 5 . 75e - 2 4 . 31e - 2 0 . 75 × that is sequence-specific . Allocating proportional to MSE 0 . 5 wins on 4/5 sequences and reduces mean KL by 25% (T able 6 ). The improvement is largest on hard sequences (Seq 1: 0 . 60 × ) and slightly negati ve on the easiest sequence (Seq 0: 1 . 10 × ), where the baseline KL is already near zero and reallocation adds noise. Square-r oot dampening. Undampened alloca- tion ( MSE 1 . 0 ) is too aggressi ve: it gi ves Layer 0 (which has 10 – 100 × the median MSE) an extreme share of the budget, starving other layers. At α = 1 . 0 , mean KL is only 0 . 97 × uniform (barely improv ed), while α = 0 . 5 achiev es 0 . 75 × . 5.5 Per -Head Allocation Per-layer allocation treats each layer as a unit, b ut with grouped-query attention, the model has 2 KV heads per layer . These heads serv e qualitativ ely dif ferent functions: Layer 15 is consistently the most asymmetric across all sequences (17–56 × head MSE ratio), with Head 0 easy to compress and Head 1 hard; Layer 0 sho ws the opposite pat- tern with e ven larger ratios (up to 467 × ). This structural asymmetry persists across all tested se- quences, motiv ating per-head b udget redistribution. Quantitati vely , the per -head pilot MSE ratio a v- erages 3 . 7 – 21 . 5 × across layers (T able 7 ). T able 8 isolates the effect of per-head redistrib u- tion. Here, “per -layer” uses a 30-step per -head pi- lot av eraged to layer lev el (noisier than the 60-step 6 T able 7: Head MSE asymmetry: mean and max ratio of per-head pilot MSE within the same layer ( r =0 . 3 ). Seq Mean ratio Max ratio Layer 0 21 . 5 × 467 × L0 1 5 . 5 × 31 × L15 2 6 . 1 × 33 × L15 3 3 . 7 × 17 × L15 4 6 . 2 × 56 × L15 T able 8: Per-head allocation alw ays improv es over per - layer alone. All numbers relativ e to uniform ( r =0 . 3 ). Seq Per -layer Per -layer+head ∆ 0 0 . 82 × 0 . 81 × − 1% 1 1 . 37 × 1 . 15 × − 16% 2 0 . 88 × 0 . 84 × − 5% 3 0 . 87 × 0 . 86 × − 1% 4 1 . 77 × 0 . 75 × − 58% Mean 1 . 05 × 0 . 92 × − 8% joint pilot in T able 6 , hence the weak er per -layer numbers). Adding per -head allocation on top con- sistently improv es KL, by 1–58% depending on the sequence; the per-layer+per -head strategy wins 4 of 5 sequences. 6 Analysis: Limits of P er -Layer Optimization The results above sho w that K V S C U L P T nearly solves the per -layer optimization problem (within 2–8% of the oracle). Y et hard sequences still ex- hibit high KL. W e now in vestigate why: the bottle- neck is not per -layer quality , but ho w errors prop- agate across layers and concentrate in specific to- kens. 6.1 Error Accumulation Acr oss Layers Per-layer optimization does not guarantee end-to- end quality because errors compound through the transformer . W e measure hidden state MSE at each layer during decoding with compressed KV and find that errors compound by two to three orders of magnitude from the first to the last layer ( 220 × for easy sequences, up to 5800 × for hard ones). Crucially , the per-layer compression MSE and end-to-end KL are only weakly correlated. Se- quence 0 has 2 . 4 × higher per-layer MSE than Se- quence 1, yet Sequence 1 has 14 × higher end-to- end KL. Some sequences are structurally more sen- siti ve to small perturbations—an ef fect in visible to the per-layer objecti ve. 6.2 Per -T oken KL Concentration KL di ver gence is not spread uniformly across continuation tokens. On the hardest sequence at r = 0 . 3 , 82% of the total KL is concentrated in just 5 of 128 tok ens. The maximum per-tok en KL is 7 . 17 , while the mean is 0 . 117 (a 61 × ratio). These “sensiti ve” tok ens tend to occur at high-entropy de- cision points (e.g., first token of a new clause), where the softmax distribution is flat and small logit perturbations cause large probability shifts. 7 Discussion Why does continuous optimization help so much? As shown in Section 5.1 , optimizing v al- ues while keeping ke ys at original positions (Joint Optimization) barely helps. The decisive factor is ke y freedom: a distilled ke y can “summarize” multiple original ke ys by positioning itself in em- bedding space such that its attention pattern approx- imates their combined ef fect—a representation that no single original token contains. This is why the e viction-to-distillation gap ( 4 × ) far exceeds the e viction-to-joint-optimization gap ( 1 . 04 × ). Deployment scenario. K V S C U L P T is designed for the offline setting: giv en a long conte xt that will be queried many times (e.g., a document, a system prompt, a retriev ed passage), the compression cost is amortized o ver many decode steps. At 100 L- BFGS steps per layer , compression takes ∼ 170s for a 2048-token conte xt on a single A100 GPU. The next bottleneck. Our analysis sho ws that the per-layer problem is essentially solv ed (within 2–8% of the oracle), yet hard sequences remain far from lossless. The bottleneck is cross-layer error propagation: small per-layer perturbations com- pound by 2–3 orders of magnitude. Cascade-aware optimization—where each layer’ s target accounts for upstream compression error—is a natural ne xt step, and the per -head decomposition makes this tractable. 8 Conclusion W e introduced K V S C U L P T , which reframes KV cache compression from discrete token selection to continuous distillation. The key insight is that after RoPE encoding, KV pairs are free vectors in R d —there is no reason the compressed cache must be a subset of the original. By optimizing keys with L-BFGS and solving v alues in closed form, K V S C U L P T achie ves near -lossless compression on 7 easy sequences (KL < 0 . 01 ) at ratios where evic- tion baselines remain an order of magnitude worse. Adapti ve b udget allocation, guided by a cheap pi- lot signal, provides a further 1 . 3 × reduction by exploiting the e xtreme non-uniformity of compres- sion difficulty across layers and heads. Our anal- ysis identifies cross-layer error propagation as the remaining bottleneck, suggesting cascade-aware optimization as a promising direction. Limitations • Single model : All experiments use Qwen2.5- 1.5B-Instruct. V alidation on other architectures (Llama, Mistral) and scales (7B+) is needed. • Fixed context length : W e test only N = 2048 . Long-context scenarios ( 8 K– 128 K) may exhibit dif ferent compression dynamics. • Compression cost : At ∼ 170s per context on an A100, K V S C U L P T is suitable for of- fline/amortized settings but not for online single- pass decoding. • KL-only ev aluation : W e measure logits KL di- ver gence but do not ev aluate downstream task accuracy (e.g., MMLU, summarization quality). • No merge baselines : W e compare against e vic- tion methods but not mer ging approaches (CaM, D2O, DMC). While our formulation theoretically subsumes merging, empirical comparison would strengthen the claim. • No comparison with quantization : KV quan- tization ( Hooper et al. , 2024 ) is complementary and could be combined with K V S C U L P T , but this is not explored. • Per -layer optimization : Errors accumulate across layers in ways not captured by the per- layer objectiv e. Global or cascade-aw are opti- mization could improv e hard sequences. References Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Y ury Zemlyanskiy , Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training generalized multi-query trans- former models from multi-head checkpoints. In Pr o- ceedings of the 2023 Conference on Empirical Meth- ods in Natural Languag e Processing . Zefan Cai, Y ichi Zhang, Bofei Gao, Y uliang Liu, T ianyu Liu, K eming Lu, W ayne Xiong, Y ue Dong, Baobao Chang, Junjie Hu, and W en Xiao. 2024. Pyra- midKV : Dynamic KV cache compression based on pyramidal information funneling. arXiv preprint arXiv:2406.02069 . T ri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and memory-efficient exact attention with IO-a wareness. In Advances in Neural Information Pr ocessing Sys- tems . Alessio Devoto, Y u Zhao, Simone Scardapane, and Pasquale Minervini. 2024. A simple and ef fective L2 norm-based strategy for KV cache compression. In Pr oceedings of the 2024 Conference on Empirical Methods in Natural Languag e Processing . Suyu Ge, Y unan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. 2024. Model tells you what to discard: Adaptiv e KV cache compression for LLMs. In International Conference on Learning Repr esentations . Geoffre y Hinton, Oriol V inyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. In NIPS Deep Learning and Repr esentation Learning W orkshop . Coleman Hooper , Sehoon Kim, Hi va Mohammadzadeh, Michael W . Mahoney , Y akun Sophia Shao, Kurt Keutzer , and Amir Gholami. 2024. KVQuant: T o- wards 10 million context length LLM inference with KV cache quantization. Advances in Neur al Informa- tion Pr ocessing Systems . W oosuk Kwon, Zhuohan Li, Siyuan Zhuang, Y ing Sheng, Lianmin Zheng, Cody Hao Y u, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Effi- cient memory management for large language model serving with PagedAttention. In Pr oceedings of the A CM SIGOPS 29th Symposium on Operating Systems Principles . Dong C. Liu and Jor ge Nocedal. 1989. On the limited memory BFGS method for lar ge scale optimization. Mathematical Pr ogramming , 45:503–528. Zichang Liu, Aditya Desai, Fangshuo Liao, W eitao W ang, V ictor Xie, Zhaozhuo Xu, Anastasios K yril- lidis, and Anshumali Shriv astav a. 2023. Scis- sorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time. In Advances in Neur al Information Pr ocessing Systems . Christos Louizos, Max W elling, and Diederik P . Kingma. 2018. Learning sparse neural networks through L 0 regularization. In International Confer ence on Learn- ing Repr esentations . Piotr Nawrot, Adrian Ła ´ ncucki, Marcin Chocho wski, David T arjan, and Edoardo M. Ponti. 2024. Dynamic memory compression: Retrofitting LLMs for accel- erated inference. In Pr oceedings of the 41st Interna- tional Confer ence on Machine Learning . 8 Qwen T eam. 2025. Qwen2.5 technical report. arXiv pr eprint arXiv:2412.15115 . Jianlin Su, Murtadha Ahmed, Y u Lu, Shengfeng Pan, W en Bo, and Y unfeng Liu. 2024. RoFormer: En- hanced transformer with rotary position embedding. Neur ocomputing , 568:127063. Ashish V aswani, Noam Shazeer , Niki Parmar , Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser , and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Pr ocess- ing Systems , 30. Zhongwei W an, Xinjian W u, Y u Zhang, Y i Xin, Chaofan T ao, Zhihong Zhu, Xin W ang, Siqi Luo, Jing Xiong, Longyue W ang, and Mi Zhang. 2025. D2O: Dynamic discriminativ e operations for ef ficient long-context inference of large language models. In International Confer ence on Learning Representations . Guangxuan Xiao, Y uandong Tian, Beidi Chen, Song Han, and Mike Le wis. 2024. Ef ficient streaming lan- guage models with attention sinks. In International Confer ence on Learning Representations . Dongjie Y ang, XiaoDong Han, Y an Gao, Y ao Hu, Shilin Zhang, and Hai Zhao. 2024. PyramidInfer: Pyramid KV cache compression for high-throughput LLM inference. In Findings of the Association for Compu- tational Linguistics: ACL 2024 . Y uxin Zhang, Y uxuan Du, Gen Luo, Y unshan Zhong, Zhenyu Zhang, Shiwei Liu, and Rongrong Ji. 2024. CaM: Cache merging for memory-ef ficient LLMs inference. In Pr oceedings of the 41st International Confer ence on Machine Learning . Zhenyu Zhang, Y ing Sheng, Tian yi Zhou, T ianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Y uan- dong T ian, Christopher Ré, Clark Barrett, Zhangyang W ang, and Beidi Chen. 2023. H2O: Heavy-hitter oracle for ef ficient generati ve inference of lar ge lan- guage models. In Advances in Neur al Information Pr ocessing Systems . A Full Baseline Results (20 Sequences) T able 9 reports KL div ergence across all 20 se- quences and 5 compression ratios for the baseline methods. T able 9: KL di vergence (mean ± std, 20 sequences) at representati ve ratios. Joint Opt mar ginally improves ov er Select+Fit; both are far behind K V S C U L P T . Method r =0 . 1 r =0 . 3 r =0 . 5 r =0 . 7 Random 3 . 43 ± 1 . 21 2 . 60 ± 0 . 45 2 . 06 ± 0 . 62 1 . 83 ± 0 . 57 Attn Score . 322 ± . 105 . 198 ± . 084 . 137 ± . 065 . 086 ± . 056 Select+Fit . 322 ± . 107 . 185 ± . 076 . 129 ± . 057 . 080 ± . 050 Joint Opt . 311 ± . 113 . 182 ± . 073 . 121 ± . 058 . 073 ± . 047 B Dampening Exponent Sensitivity T able 10 shows the effect of the dampening ex- ponent α in ∥ V ∥ α -weighted allocation on Seq 1 ( r =0 . 3 ). The relationship is non-monotonic: α = 0 . 5 is optimal. Pilot-MSE allocation shows the same dampening pattern ( 0 . 97 × at α =1 . 0 , 0 . 75 × at α =0 . 5 , mean ov er 5 sequences). T able 10: Dampening exponent α for ∥ V ∥ α per-layer allocation (Seq 1, r =0 . 3 ). α =0 . 5 is optimal; higher values o ver-allocate to outlier layers. α KL vs. Unif orm k min k max 0 (uniform) 1 . 170e - 1 ref 358 358 0.3 1 . 175e - 1 1 . 00 × 277 480 0.5 8 . 83 e - 2 0 . 75 × 231 578 0.7 1 . 328e - 1 1 . 14 × 191 689 1.0 9 . 27e - 2 0 . 79 × 140 880 9

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment