Draft-based Approximate Inference for LLMs
Optimizing inference for long-context large language models (LLMs) is increasingly important due to the quadratic compute and linear memory cost of Transformers. Existing approximate inference methods, including key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on coarse predictions of token or KV pair importance. We unify and extend recent work by introducing a framework for approximate LLM inference that leverages small draft models to more accurately predict token and KV pair importance. We provide novel theoretical and empirical analyses justifying lookahead-based importance estimation techniques. Within this framework, we present: (i) SpecKV, the first method to use lookahead with a small draft model to enable precise KV cache dropping; (ii) SpecPC, which leverages draft model attention activations to identify and discard less important prompt tokens; and (iii) SpecKV-PC, a cascaded compression strategy combining both techniques. Extensive experiments on long-context benchmarks demonstrate that our methods consistently achieve higher accuracy than existing baselines while retaining the same efficiency gains in memory usage, latency, and throughput.
💡 Research Summary
The paper addresses the growing need to run large language models (LLMs) with very long contexts, a scenario that is increasingly common in dialogue systems, document summarisation, code completion and other applications. Standard Transformers suffer from quadratic attention cost and linear growth of the key‑value (KV) cache, which quickly exhausts GPU memory and inflates latency when the context length reaches tens of thousands of tokens. Existing approximate inference techniques—KV‑cache dropping, sparse attention, prompt compression—attempt to reduce these costs by estimating the importance of each input token or KV pair. However, they rely solely on the current attention patterns of the target model, which are only a coarse proxy for future relevance because future output tokens are not yet available.
The authors propose a unifying framework called Draft‑based Approximate Inference. The central idea is to run a small draft model (a lightweight version of the target model that shares the same tokenizer and architecture) in parallel with the target model. The draft model quickly generates “look‑ahead” tokens for the given prompt. By feeding these draft outputs back into the attention computation, the framework can estimate how much each input token or KV pair will be attended to by future queries. This look‑ahead information yields far more accurate importance scores than methods that only see the current state.
Two concrete algorithms are built on this framework:
-
SpecKV (Speculative KV Dropping) – During the sparse pre‑fill stage, the draft model’s look‑ahead tokens are used to compute an importance vector s for every KV pair (average attention from all future queries). The algorithm then drops KV pairs whose importance falls below a threshold, respecting a pre‑specified cache budget Cₘₐₓ. Theoretical analysis (Theorem 1) shows that if the draft model’s embedding error ε is bounded, the resulting importance error is O(ε·√d), guaranteeing that a sufficiently accurate draft model yields reliable KV‑dropping decisions.
-
SpecPC (Speculative Prompt Compression) – The draft model’s attention scores are aggregated to produce per‑token importance values. Tokens with low scores are removed from the prompt before it reaches the target model, reducing both attention and MLP work for the entire generation.
A third method, SpecKV‑PC, cascades SpecPC followed by SpecKV, achieving the strongest memory reduction while preserving accuracy.
The paper provides a detailed complexity table. Compared with prior work (SnapKV, PyramidKV, H2O, LA Q++, etc.), SpecKV retains the same asymptotic time for pre‑fill (O(n_in·s_prefill)) and decoding (O(n_out·(Cₘₐₓ + n_out))) while offering a tighter memory bound because it actively discards low‑importance KV pairs. SpecPC adds only O(Cₘₐₓ²) memory for the compression step, which is negligible relative to the full KV cache.
Empirical evaluation is conducted on long‑context benchmarks (RULER‑32K) using Llama‑3‑70B, Qwen2.5‑1B/14B and other models. Key findings include:
- Lower embedding error ε (achieved by larger draft models or larger initial cache) correlates with higher downstream RULER scores. SpecKV consistently outperforms LA Q++ by achieving smaller ε for the same budget.
- Importance scores derived from the draft model correlate strongly with those of the target model (R² ranging from 0.85 to 0.97), confirming the theoretical claim that the draft model provides a reliable proxy.
- SpecPC’s performance improves as the draft model size grows, reflecting better token‑importance estimation.
- Across all settings, the proposed methods match or slightly improve latency, throughput, and memory usage relative to baselines, while delivering 2–4 percentage‑point gains in accuracy under identical cache or prompt size constraints.
Critical discussion: The approach introduces an auxiliary draft model, which incurs extra compute and memory. The paper argues the draft model is lightweight, but does not quantify the net system‑level cost when both models run concurrently on a single GPU or in a distributed setting. Moreover, experiments focus on English summarisation and QA tasks; the generality to multilingual, code, or multimodal contexts remains untested. Interaction effects between KV dropping and prompt compression (e.g., whether removed KV pairs overlap with compressed tokens) are not deeply analysed. Finally, the draft model is treated as a fixed black box; exploring training strategies (e.g., fine‑tuning the draft for better importance prediction) could further boost performance.
In conclusion, the work demonstrates that look‑ahead via a small draft model is a powerful tool for approximate inference. By converting future‑token information into accurate importance estimates, SpecKV and SpecPC achieve state‑of‑the‑art trade‑offs between memory, latency, and accuracy for long‑context LLM inference. The framework opens a promising direction for deploying ever‑larger models in resource‑constrained environments while maintaining high-quality outputs.
Comments & Academic Discussion
Loading comments...
Leave a Comment