P-EAGLE: Parallel-Drafting EAGLE with Scalable Training
Reasoning LLMs produce longer outputs, requiring speculative decoding drafters trained on extended sequences. Parallel drafting - predicting multiple tokens per forward pass - offers latency benefits over sequential generation, but training complexity scales quadratically with the product of sequence length and parallel positions, rendering long-context training impractical. We present P(arallel)-EAGLE, which transforms EAGLE from autoregressive to parallel multi-token prediction via a learnable shared hidden state. To scale training to long contexts, we develop a framework featuring attention mask pre-computation and sequence partitioning techniques, enabling gradient accumulation within individual sequences for parallel-prediction training. We implement P-EAGLE in vLLM and demonstrate speedups of 1.10-1.36x over autoregressive EAGLE-3 across GPT-OSS 120B, 20B, and Qwen3-Coder 30B.
💡 Research Summary
The paper introduces P‑EAGLE, a parallel‑drafting extension of the EAGLE lightweight draft model designed to address the latency bottleneck of autoregressive decoding in large language models (LLMs). Autoregressive decoding requires a full forward pass of the entire model for each generated token, which becomes especially costly when reasoning‑capable models produce long outputs (median > 3,900 tokens, P90 > 10,800 tokens on the UltraChat benchmark). Speculative decoding mitigates this by having a draft model propose multiple candidate tokens that are verified in a single pass by the target model, but existing draft models, including the original EAGLE, still generate drafts autoregressively, incurring K sequential forward passes for K draft tokens.
Parallel drafting promises to eliminate this sequential overhead by predicting multiple tokens in a single forward pass. Prior parallel‑drafting works (ParallelSpec, P‑ARD) either lack sufficient implementation details or suffer from quadratic memory growth due to the product of sequence length (n) and parallel prediction depth (K). Specifically, the attention matrix scales as O((n·K)²), making training on long contexts (≥ 8 k tokens) infeasible. Moreover, Conditional Drop‑token (COD) sampling used in P‑ARD reduces the number of positions per depth but still requires per‑example mask construction, which becomes a severe computational bottleneck.
P‑EAGLE solves these problems through two main innovations. First, it introduces a learnable shared hidden state h that is reused for all Multi‑Token Prediction (MTP) positions. In the original EAGLE, each token’s prediction depends on the hidden vector from the previous step; in a parallel setting those vectors are unavailable. By sharing a single hidden vector across all MTP positions, the model can generate K draft tokens in one pass without needing position‑specific parameters. The authors provide a theoretical argument that the attention mechanism itself encodes sufficient positional information, making separate hidden states unnecessary. Empirically, four alternative designs (depth‑specific encodings, projected NTP vectors, etc.) were compared, and the shared‑state design outperformed them by 7–15 % in acceptance length.
Second, the authors develop a scalable training framework that eliminates the O((n·K)²) mask‑construction cost and reduces memory consumption. They observe that the causal attention pattern across depths is position‑invariant: the mask for any sequence is simply the top‑left sub‑matrix of a maximum‑length mask. Consequently, they pre‑compute a single mask for the longest sequence (e.g., 20 k tokens) at initialization and obtain per‑example masks via constant‑time tensor slicing. This eliminates the 48× data‑loading slowdown and 5× epoch‑time increase observed in P‑ARD for 2 k‑token sequences.
Even with pre‑computed masks, the total number of positions L = n·K can still be huge, leading to O(L²) attention memory. To address this, P‑EAGLE introduces “within‑sequence gradient accumulation.” The long sequence is partitioned into S segments; each segment is processed independently with its own forward‑backward pass, and gradients are accumulated across segments. The key challenge is preserving COD‑induced cross‑depth dependencies: a token at depth d attends to its predecessor at depth d‑1, which may belong to a different segment if partitioning is naïve. The authors propose an iterative assignment algorithm: depth 0 and depth 1 positions are assigned based on their original indices; for deeper depths, each position inherits the segment of its depth‑1 dependency. This guarantees that any token and its required predecessor reside in the same segment, maintaining the causal structure. The resulting memory footprint scales as O(L²/S²), enabling training on 20 k‑token sequences with K = 8 on a single H200 GPU.
Architecturally, P‑EAGLE follows the LLaMA‑3 design with rotary positional embeddings (RoPE). Token embeddings are unfrozen to learn a meaningful mask‑token embedding for MTP positions. Experiments show that increasing the drafter depth from 1 to 4 layers yields a 46 % improvement in acceptance length, and aligning training and inference prediction depths (K) is crucial for optimal performance.
The system is integrated into vLLM and evaluated on three large models: GPT‑OSS 120B, GPT‑OSS 20B, and Qwen3‑Coder 30B. Across these models, P‑EAGLE achieves speedups of 1.10× to 1.36× over the autoregressive EAGLE‑3 baseline while maintaining or improving acceptance length. Notably, on the UltraChat dataset, where long reasoning traces are common, P‑EAGLE mitigates the up‑to‑25 % drop in acceptance rate observed when using draft models trained on shorter contexts.
In summary, the paper makes three contributions: (1) a scalable training pipeline for parallel drafting that combines amortized mask construction and novel within‑sequence gradient accumulation; (2) a simple yet effective shared‑hidden‑state architecture for multi‑token prediction; and (3) a thorough empirical validation showing that parallel drafting can be made practical for long‑context LLM inference, delivering measurable latency reductions without sacrificing generation quality.
Comments & Academic Discussion
Loading comments...
Leave a Comment