SDFP: Speculative Decoding with FIT-Pruned Models for Training-Free and Plug-and-Play LLM Acceleration
Large language models (LLMs) underpin interactive multimedia applications such as captioning, retrieval, recommendation, and creative content generation, yet their autoregressive decoding incurs substantial latency. Speculative decoding reduces latency using a lightweight draft model, but deployment is often limited by the cost and complexity of acquiring, tuning, and maintaining an effective draft model. Recent approaches usually require auxiliary training or specialization, and even training-free methods incur costly search or optimization. We propose SDFP, a fully training-free and plug-and-play framework that builds the draft model via Fisher Information Trace (FIT)-based layer pruning of a given LLM. Using layer sensitivity as a proxy for output perturbation, SDFP removes low-impact layers to obtain a compact draft while preserving compatibility with the original model for standard speculative verification. SDFP needs no additional training, hyperparameter tuning, or separately maintained drafts, enabling rapid, deployment-friendly draft construction. Across benchmarks, SDFP delivers 1.32x-1.5x decoding speedup without altering the target model’s output distribution, supporting low-latency multimedia applications.
💡 Research Summary
The paper introduces SDFP (Speculative Decoding with FIT‑Pruned models), a training‑free, plug‑and‑play framework that creates a lightweight draft model for speculative decoding by pruning layers of a pretrained large language model (LLM) based on Fisher Information Trace (FIT) scores. Speculative decoding (SD) accelerates autoregressive generation by having a fast draft model propose a block of k tokens, which the high‑quality target model then verifies using an accept‑reject rule that preserves the exact output distribution. Existing approaches to obtain an effective draft model typically require additional fine‑tuning, meta‑learning, or costly hyper‑parameter searches, which hampers deployment.
SDFP eliminates this overhead by estimating layer‑wise sensitivity directly from the model’s Fisher information. For each transformer layer ℓ, the empirical Fisher matrix is approximated by the sum of squared gradients over a small calibration set (e.g., WikiText2). The trace of this matrix, Tr (I(θℓ)), serves as the FIT score: lower scores indicate that perturbations (such as pruning) have little impact on the model’s output distribution. After computing FIT for all layers, SDFP sorts them and removes the lowest‑scoring layers according to a user‑specified pruning ratio r (typically 10‑30%). This pruning requires only a single forward‑backward pass per minibatch, no second‑order derivatives, and no task‑specific data, making it fast and broadly applicable.
The resulting pruned network becomes the draft model ˆf. During decoding, ˆf generates k speculative tokens in parallel; the target model fθ evaluates the same block using cached KV states and computes acceptance probabilities α_i = min(1, pθ/qϕ). Accepted tokens are emitted immediately; upon the first rejection, the target model samples a corrected token, ensuring that the final output distribution matches that of the full model. Because the draft model retains only the most informative layers, it remains fast while still producing predictions that are often accepted, yielding substantial speedups without sacrificing quality.
Experiments cover LLaMA‑2‑13B, its chat variant, and LLaMA‑2‑70B, comparing SDFP against prior plug‑and‑play methods such as Parallel, Lookahead, and SWIFT. Across benchmarks (CNN/DM summarization, GSM8K math, TinyStories generation), SDFP achieves 1.32×‑1.5× higher token‑per‑second throughput (≈28 tokens/s for 13B models) while maintaining acceptance rates comparable to the best existing methods. Notably, SDFP requires zero additional training, no Bayesian optimization, and negligible offline overhead (pruning completes within minutes).
The paper’s contributions are threefold: (1) introducing FIT as a unified sensitivity metric that captures both weight and activation perturbations for draft model construction; (2) presenting a fully training‑free pruning pipeline that can be applied to any pretrained LLM without task‑specific tuning; and (3) demonstrating competitive or superior acceleration across multiple model sizes and tasks.
Limitations include the fact that layer pruning alone does not dramatically reduce parameter count, so memory savings are modest, and FIT scores depend on the calibration dataset, which may affect robustness for highly specialized domains. Future work could explore finer‑grained pruning (channel or neuron level), dynamic adjustment of pruning ratios based on real‑time acceptance statistics, or combining FIT‑guided pruning with quantization or knowledge distillation.
In summary, SDFP offers a practical, theoretically grounded solution to the primary barrier of deploying speculative decoding: obtaining an effective draft model. By turning pruning into draft‑model generation, it makes speculative decoding instantly applicable to a wide range of LLMs, opening the door for low‑latency, high‑throughput language‑model services in interactive multimedia, recommendation, and real‑time AI assistants.
Comments & Academic Discussion
Loading comments...
Leave a Comment