Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs

Reading time: 5 minute
...

📝 Original Info

  • Title: Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs
  • ArXiv ID: 2512.20573
  • Date: 2025-12-23
  • Authors: ** Rui Pan¹, Zhuofu Chen¹, Hongyi Liu², Arvind Krishnamurthy³⁴, Ravi Netravali¹ ¹프린스턴 대학교, ²라이스 대학교, ³구글, ⁴워싱턴 대학교 **

📝 Abstract

Diffusion Large Language Models (dLLMs) offer fast, parallel token generation, but their standalone use is plagued by an inherent efficiency-quality tradeoff. We show that, if carefully applied, the attributes of dLLMs can actually be a strength for drafters in speculative decoding with autoregressive (AR) verifiers. Our core insight is that dLLM's speed from parallel decoding drastically lowers the risk of costly rejections, providing a practical mechanism to effectively realize the (elusive) lengthy drafts that lead to large speedups with speculative decoding. We present FailFast, a dLLM-based speculative decoding framework that realizes this approach by dynamically adapting its speculation length. It "fails fast" by spending minimal compute in hard-to-speculate regions to shrink speculation latency and "wins big" by aggressively extending draft lengths in easier regions to reduce verification latency (in many cases, speculating and accepting 70 tokens at a time!). Without any fine-tuning, FailFast delivers lossless acceleration of AR LLMs and achieves up to 4.9$\times$ speedup over vanilla decoding, 1.7$\times$ over the best naive dLLM drafter, and 1.7$\times$ over EAGLE-3 across diverse models and workloads. We open-source FailFast at https://github.com/ruipeterpan/failfast.

💡 Deep Analysis

Figure 1

📄 Full Content

Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs Rui Pan 1 * Zhuofu Chen 1 Hongyi Liu 2 Arvind Krishnamurthy 3 4 Ravi Netravali 1 Abstract Diffusion Large Language Models (dLLMs) of- fer fast, parallel token generation, but their stan- dalone use is plagued by an inherent efficiency- quality tradeoff. We show that, if carefully ap- plied, the attributes of dLLMs can actually be a strength for drafters in speculative decoding with autoregressive (AR) verifiers. Our core in- sight is that dLLM’s speed from parallel decoding drastically lowers the risk of costly rejections, providing a practical mechanism to effectively realize the (elusive) lengthy drafts that lead to large speedups with speculative decoding. We present FailFast, a dLLM-based speculative de- coding framework that realizes this approach by dynamically adapting its speculation length. It “fails fast” by spending minimal compute in hard- to-speculate regions to shrink speculation latency and “wins big” by aggressively extending draft lengths in easier regions to reduce verification latency (in many cases, speculating and accept- ing 70 tokens at a time!). Without any fine- tuning, FailFast delivers lossless acceleration of AR LLMs and achieves up to 4.9× speedup over vanilla decoding, 1.7× over the best naive dLLM drafter, and 1.7× over EAGLE-3 across diverse models and workloads. We open-source FailFast at https://github.com/ruipeterpan /failfast. 1. Introduction A new wave of Diffusion Large Language Models (dLLMs) (Khanna et al., 2025; Song et al., 2025; gem, 2025; Bie et al., 2025; Wu et al., 2025b) has emerged as a compelling alternative to the standard autoregressive paradigm in large language models. Unlike autoregressive (AR) LLMs, which are constrained to generating tokens *Work partially done during internship at Google. 1Princeton University 2Rice University 3Google 4University of Washington. Correspondence to: Rui Pan . Preprint. January 29, 2026. one by one from left to right, dLLMs possess the unique capability to predict and unmask multiple tokens at arbitrary positions simultaneously. Crucially, this decoding process is highly customizable: the model’s unmasking strategy de- termines exactly which and how many tokens are unmasked during each denoising step (a model forward pass). As such, dLLMs are highly attractive for low-latency inference. Yet despite their speed, parallel generation imposes a funda- mental limit on modeling accuracy. This limitation stems from the conditional independence assumption required for simultaneous sampling of multiple tokens; by treating to- kens generated within the same step as independent of one another, the decoding process inevitably ignores crucial mu- tual dependencies (Wu et al., 2025b; Kang et al., 2025). Consequently, a direct tension emerges between efficiency and quality. Improving the generation speed (i.e., using fewer forward passes) necessitates unmasking a larger num- ber of tokens per step, which exacerbates the risk of quality degradation. Conversely, maximizing quality forces the sampling procedure to adopt a strict left-to-right, one-token- per-step order that essentially falls back to the speed of autoregressive generation. While existing work strives to alleviate the stark compute- accuracy tradeoff of dLLMs as standalone generators (Kang et al., 2025; Bie et al., 2025; Qian et al., 2026), this work instead focuses on motivating and realizing a scenario in which we argue that dLLMs are intrinsically beneficial: as draft models in speculative decoding (Leviathan et al., 2023) with autoregressive target models. Our proposal extends be- yond a simple drop-in replacement of dLLMs as drafters in existing speculative decoding strategies to reap their latency benefits – indeed, we later show how this can forgo substan- tial benefits they bring. Instead, our approach is rooted in two key observations that challenge the status quo for both speculative decoding design and considerations around the limitations of dLLMs. First, dLLMs can generate long drafts quickly. In AR LLM inference, the decoding latency scales with the num- ber of output tokens (i.e., the number of model forward passes) (Yan et al., 2024; Agrawal et al., 2024), whereas dLLMs can unmask multiple tokens in each forward pass, so the latency is instead linear to the number of model 1 arXiv:2512.20573v3 [cs.LG] 28 Jan 2026 Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs I1 In I1 … In I1 … In I1 … In O1 … O1 O2 O1 O2 O10 … Autoregressive Drafter 10 fwd passes for 10 tokens … I1 … In M … Fast-dLLM Drafter M M M M I1 … In O1 … M O3 M M I1 … In … O1 O2 O10 O3 O4 7 fwd passes for 10 tokens … FailFast (in hardest-to-speculate regions) I1 … In M … M M M M I1 … In … O1 O2 O10 O3 O4 “Fail Fast”: 1 fwd pass for 10 tokens FailFast (in easiest-to-speculate regions) “Win Big”: 6 fwd passes for 60 tokens! I1 … In M …

📸 Image Gallery

accepted.png ar.png cdf_len_distribution.png concavity.png difficulty_intuition.png eagle3_14b.png eagle3_32b.png eagle3_7b.png intuition.png latency_breakdown_models.png micro_maxlen.png micro_threshold.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut