PACER: Blockwise Pre-verification for Speculative Decoding with Adaptive Length

PACER: Blockwise Pre-verification for Speculative Decoding with Adaptive Length
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Speculative decoding (SD) is a powerful technique for accelerating the inference process of large language models (LLMs) without sacrificing accuracy. Typically, SD employs a small draft model to generate a fixed number of draft tokens, which are then verified in parallel by the target model. However, our experiments reveal that the optimal draft length varies significantly across different decoding steps. This variation suggests that using a fixed draft length limits the potential for further improvements in decoding speed. To address this challenge, we propose Pacer, a novel approach that dynamically controls draft length using a lightweight, trainable pre-verification layer. This layer pre-verifies draft tokens blockwise before they are sent to the target model, allowing the draft model to stop token generation if the blockwise pre-verification fails. We implement Pacer on multiple SD model pairs and evaluate its performance across various benchmarks. Our results demonstrate that Pacer achieves up to 2.66x Speedup over autoregressive decoding and consistently outperforms standard speculative decoding. Furthermore, when integrated with Ouroboros, Pacer attains up to 3.09x Speedup.


💡 Research Summary

Speculative decoding (SD) has emerged as a powerful technique to accelerate inference of large language models (LLMs) by allowing a small “draft” model to generate a batch of candidate tokens that are later verified in parallel by the much larger target model. While this approach can dramatically reduce the number of sequential forward passes, existing SD methods rely on a fixed draft window size γ (the number of tokens generated before verification). The authors of this paper conduct a systematic empirical study that reveals a critical limitation: the optimal acceptance length L_A—the number of draft tokens that will actually be accepted by the target model—varies dramatically from step to step during generation. When γ is set too small, the target model must be invoked frequently, negating most of the speed gains. When γ is set too large, many draft tokens are wasted because they are later rejected. Their analysis shows that dynamically matching γ to the per‑step optimal length could improve throughput by up to 1.4× compared with the best fixed‑γ baseline.

To address this, the authors propose PACER (Blockwise Pre‑verification for Speculative Decoding with Adaptive Length). PACER introduces a lightweight, trainable pre‑verification module (denoted M_B) that sits between the draft model (M_D) and the target model (M_T). The draft model generates tokens in blocks of size b (e.g., b = 3). After each block, M_B consumes the hidden states of the draft tokens together with positional embeddings and predicts an acceptance probability α̂ for each token. The mean acceptance probability across the block is compared against a threshold t. If the mean falls below t, drafting stops and the accumulated block is sent to the target model for full verification; otherwise, the draft model proceeds to generate another block. The threshold t is increased by a factor ρ > 1 after each successful block, making it progressively easier to stop drafting as generation proceeds. This blockwise strategy amortizes the overhead of the pre‑verification step while still providing fine‑grained control over the draft length.

Training of M_B is performed offline using data generated by the target model. For each training example, the target model produces a reference output; the draft model then generates a long sequence of candidate tokens (e.g., γ = 50). Tokens that match the reference are labeled as accepted (1), and the first mismatching token and all subsequent tokens are labeled as rejected (0). The pre‑verification module is trained with a standard cross‑entropy loss to predict these binary labels. To improve efficiency, the authors pack multiple decoding steps into a single training sequence and employ custom attention masks that respect the blockwise structure.

The authors evaluate PACER on several model pairs, including DeepSeek‑Coder (1.3 B → 33 B), Llama‑2 (7 B → 70 B), and Qwen‑2.5 (7 B → 72 B). Benchmarks span code generation (HumanEval), mathematical reasoning (MATH), and text summarization. Across all settings, PACER consistently outperforms fixed‑γ speculative decoding, achieving up to 2.66× speedup over vanilla autoregressive decoding. When combined with the Ouroboros framework—a recent method that improves draft quality—PACER reaches a peak speedup of 3.09×. The authors also provide an analysis of acceptance rates as a function of token position, confirming that later tokens in a draft block are less likely to be accepted, which justifies the inclusion of positional embeddings in M_B.

Key contributions of the paper are: (1) a thorough empirical demonstration that adaptive draft lengths are essential for efficient speculative decoding; (2) the design of a blockwise pre‑verification module that dynamically determines the optimal draft window size with minimal overhead; (3) a training pipeline that aligns pre‑verification predictions with the actual verification process; and (4) extensive experiments showing that PACER is compatible with existing speculative decoding enhancements and can be seamlessly integrated into diverse LLM pipelines.

In summary, PACER offers a simple yet effective mechanism—blockwise pre‑verification plus adaptive thresholding—to close the gap between the theoretical speedup potential of speculative decoding and its practical realization. By intelligently stopping draft generation before costly target verification, it reduces both wasted draft computation and unnecessary target forward passes, delivering substantial real‑world inference acceleration for large language models.


Comments & Academic Discussion

Loading comments...

Leave a Comment