Dynamic Delayed Tree Expansion For Improved Multi-Path Speculative Decoding

Dynamic Delayed Tree Expansion For Improved Multi-Path Speculative Decoding
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multi-path speculative decoding accelerates lossless sampling from a target model by using a cheaper draft model to generate a draft tree of tokens, and then applies a verification algorithm that accepts a subset of these. While prior work has proposed various verification algorithms for i.i.d rollouts, their relative performance under matched settings remains unclear. In this work, we firstly present a systematic evaluation of verification strategies across model families, tasks, and sampling regimes, and find that Traversal Verification dominates consistently, with OT-based methods lagging far behind. Our analysis uncovers that this occurs because OT-based methods achieve high multi-token acceptance near the root of the draft tree, while multi-token gains are most impactful deeper in the draft tree, where draft and target distributions diverge. Based on this insight, we propose delayed tree expansion, which drafts a partial single path, delaying the i.i.d. branching point. We show that delayed tree expansion preserves the target distribution and improves on root-node i.i.d rollouts. Further, we develop a dynamic neural selector that estimates the expected block efficiency of optimal-transport-based verification methods from draft and target features, enabling context-dependent expansion decisions. Our neural selector allows OT-based methods like SpecInfer to outperform Traversal Verification for the first time, achieving 5% higher average throughput across a wide range of models, datasets, and sampling settings.


💡 Research Summary

The paper tackles two intertwined challenges in multi‑path speculative decoding for large language models (LLMs): (1) determining which verification algorithm yields the best speed‑up under comparable drafting conditions, and (2) designing a drafting policy that maximizes the benefit of the chosen verifier.
First, the authors conduct a large‑scale, controlled benchmark of all existing multi‑path verification methods—including naive acceptance, NSS, SpecT, SpecInfer, and Traversal Verification—across several model families (Gemma, Qwen, Llama), tasks (translation, coding, math reasoning), and sampling regimes (different temperatures and nucleus parameters). They keep the draft generation identical (i.i.d. rollouts) for every method, thereby isolating the verifier’s contribution. The primary metric is block efficiency E


Comments & Academic Discussion

Loading comments...

Leave a Comment