CompilerKV: Risk-Adaptive KV Compression via Offline Experience Compilation
Large Language Models (LLMs) in long-context scenarios are severely constrained by the linear growth of Key-Value (KV) cache memory. Existing KV compression methods rely either on static thresholds and attention-only heuristics or on coarse memory budget allocation. Under tight memory budgets, these methods overlook two key factors: prompt-dependent variation in compression risk and functional heterogeneity across attention heads, which destabilize token selection and lead to tail failures. To address these challenges, we propose CompilerKV, a risk-adaptive and head-aware compression framework that compiles offline experience into reusable decision tables for prefill-only deployment. CompilerKV integrates two key synergistic components: (i) a Head Heterogeneity Table, learned via offline contextual bandits, which assigns head-specific reliability weights to govern functional differences across attention heads explicitly; and (ii) a Risk-Adaptive Threshold Gating mechanism that jointly models attention entropy and local perplexity, transforming prompt-level risk into deployable retention thresholds. Experiments on LongBench show CompilerKV dominates SOTA methods under a 512-token budget, recovering 97.7% of FullKV performance while achieving up to +5.2 points gain over the strongest competitor.
💡 Research Summary
Large language models (LLMs) excel at long‑context tasks such as summarization and multi‑hop reasoning, but their key‑value (KV) cache grows linearly with sequence length, quickly exhausting GPU memory. Existing KV compression techniques either use static sparsity patterns (e.g., sliding windows, fixed sinks) or dynamic importance scores (e.g., Heavy‑Hitter Oracle, SnapKV, PyramidKV). While these methods improve memory efficiency, they suffer from two critical shortcomings when the memory budget is tight and the compression must be performed before generation (prefill‑only): (1) they apply a prompt‑agnostic compression rate, ignoring that some prompts are intrinsically high‑risk (high entropy, complex semantics) and thus more vulnerable to information loss; (2) they treat all attention heads equally, despite mounting evidence that a small subset of heads drives long‑context capabilities while many others are noisy or redundant. Consequently, token eviction decisions become unstable, leading to “tail failures” where a few discarded tokens cause catastrophic degradation.
CompilerKV reframes prefill‑only KV compression as a one‑shot, irreversible decision problem under a strict token budget per layer. The authors model this as a contextual bandit: the context is the prompt and layer‑budget information, the action is a discrete compression choice, and the reward balances compression fidelity against budget violations. They then compile offline experience into two static lookup tables that are consulted at inference time, eliminating any online overhead.
The first component, the Head Heterogeneity Table (HHT), is learned via offline contextual bandits with Conservative Q‑Learning (CQL). For each layer‑head pair (l, h) the state includes the layer index, head index, and the layer’s token budget. The learned weight wₗ,ₕ acts as a reliability multiplier for that head’s utility score, effectively down‑weighting noisy heads and amplifying the contribution of retrieval‑oriented heads. Theoretical analysis (Theorem 4.1) shows that incorporating these weights yields a provable bound on the attention approximation error, demonstrating why head‑aware weighting mitigates tail failures.
The second component, Risk‑Adaptive Threshold Gating (RATG), quantifies prompt‑level risk using two complementary statistics: (i) attention entropy H, measuring how diffuse the attention distribution is across tokens, and (ii) local perplexity P, reflecting predictive uncertainty of the model on the prompt. A linear combination R = λ₁·H + λ₂·P produces a scalar risk score. This score is mapped through a pre‑computed LUT to a retention threshold τ(R), which determines how many tokens should be kept. By adapting τ to the prompt’s difficulty, the system retains more tokens for high‑risk inputs and prunes aggressively for low‑risk ones.
Utility estimation itself is made robust. The authors first compute a global mean attention (\bar A_{j,t}) by averaging raw attention weights across all layers and heads, thereby smoothing out head‑specific spikes. They also normalize each token’s value‑vector L2 norm by the sequence‑wide average, yielding a relative value magnitude ρₗ,ₕ(t) that removes scale bias across layers. The base utility for token t in head h of layer l is then uₗ,ₕ(t) = αₗ,ₕ(t)·ρₗ,ₕ(t), where αₗ,ₕ(t) aggregates attention over a sliding window at the end of the prompt (window‑cumulative attention). The final score used for selection is sₗ,ₕ(t) = wₗ,ₕ·uₗ,ₕ(t). Tokens whose scores exceed the risk‑adapted threshold τ(R) are retained; the rest are evicted, producing a compressed KV cache that respects the per‑layer token budget.
Empirical evaluation is conducted on the LongBench suite (16 datasets) using four LLM backbones (LLaMA‑7B, LLaMA‑13B, Falcon‑7B, InternLM‑7B). Under a stringent 512‑token budget (≈5 % of full KV memory), CompilerKV recovers 97.7 % of the full‑KV performance. It outperforms the strongest prior methods (SnapKV, PyramidKV, DynamicKV, etc.) by an average of 5.2 points in ROUGE‑L / Exact‑Match and up to 7.8 points on the most challenging summarization tasks. Ablation studies reveal that removing the head‑weighting or the risk‑gating each degrades performance by roughly 3 %, while removing both leads to a 7 % drop, confirming their complementary nature. The offline compilation requires about 50 K prompts (≈100 M tokens) and 12 hours on an 8‑GPU A100 cluster; at inference time, both tables are accessed in O(1) time, adding less than 0.2 % latency overhead.
Limitations include the focus on prefill‑only scenarios (no dynamic re‑compression during decoding) and the need to retrain the HHT if the model’s attention heads change after fine‑tuning. Future work could extend the risk model with richer semantic signals (POS tags, named entities) and explore hybrid online‑offline schemes.
In summary, CompilerKV introduces a novel, risk‑adaptive, head‑aware KV compression paradigm that compiles offline experience into static lookup tables, enabling near‑full performance under extreme memory constraints with negligible runtime cost. This advance paves the way for deploying long‑context LLMs on commodity hardware, broadening their applicability in real‑world services that demand both depth of context and efficiency.
Comments & Academic Discussion
Loading comments...
Leave a Comment