Cross-Attention Speculative Decoding

Cross-Attention Speculative Decoding
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Speculative decoding (SD) is a widely adopted approach for accelerating inference in large language models (LLMs), particularly when the draft and target models are well aligned. However, state-of-the-art SD methods typically rely on tightly coupled, self-attention-based Transformer decoders, often augmented with auxiliary pooling or fusion layers. This coupling makes them increasingly complex and harder to generalize across different models. We present Budget EAGLE (Beagle), the first, to our knowledge, cross-attention-based Transformer decoder SD model that achieves performance on par with leading self-attention SD models (EAGLE-v2) while eliminating the need for pooling or auxiliary components, simplifying the architecture, improving training efficiency, and maintaining stable memory usage during training-time simulation. To enable effective training of this novel architecture, we propose Two-Stage Block-Attention Training, a new method that achieves training stability and convergence efficiency in block-level attention scenarios. Extensive experiments across multiple LLMs and datasets show that Beagle achieves competitive inference speedups and higher training efficiency than EAGLE-v2, offering a strong alternative for architectures in speculative decoding.


💡 Research Summary

The paper revisits speculative decoding (SD), a technique that accelerates inference in large language models (LLMs) by generating multiple tokens with a lightweight draft model before verification by the full target model. While recent state‑of‑the‑art SD systems (e.g., EAGLE‑v2) achieve impressive speedups, they rely on tightly coupled self‑attention‑based Transformer decoders augmented with auxiliary pooling or fusion layers. This coupling inflates architectural complexity, hampers portability across different LLMs, and makes training and integration cumbersome.

To address these limitations, the authors introduce Budget EAGLE (Beagle), the first cross‑attention‑only SD decoder. Beagle replaces the conventional multi‑layer self‑attention block with a single cross‑attention layer followed by a point‑wise MLP. In this design, the query stream originates from the low‑level token embeddings, while the key/value streams are drawn from the higher‑level hidden states of the target model (or from previously generated draft states). A diagonal mask prevents future‑token leakage, preserving the causal nature of decoding, yet the same mask can be reused during training to predict several future tokens in parallel. By eliminating the need for pooling, copying, and concatenation of high‑level states, the architecture achieves superior memory locality and a drastic reduction in parameter count.

Training stability is achieved through a novel Two‑Stage Block‑Attention Training regimen. In the first stage, the model is pre‑trained to predict a fixed horizon of k future tokens simultaneously, encouraging it to capture inter‑token dependencies early on. In the second stage, the authors adopt a Training‑Time‑Testing (TTT) style simulation of the inference process, but thanks to the cross‑attention structure the KV‑cache can be efficiently reused and reset to contain only true target states at the end of each SD iteration. This keeps GPU memory usage constant, enabling the full training of a 7 B parameter model on a single 24 GiB GPU—a feat that would be prohibitive with conventional self‑attention SD models.

Extensive experiments are conducted on multiple LLM backbones (including LLaMA‑7B, GPT‑Neo, and others) and a variety of benchmark datasets (language modeling, code generation, etc.). Beagle matches or exceeds the token‑acceptance length of EAGLE‑v2 while delivering 1.8×–2.1× higher throughput, and in high‑acceptance regimes it can achieve speedups beyond 3×. Quality metrics such as perplexity and BLEU remain on par with or slightly better than the baseline. Training efficiency improves markedly: total training time is reduced by 30%–40% for the same data volume, and peak memory consumption drops to less than 20% of that required by self‑attention SD models. Moreover, the removal of auxiliary layers simplifies the codebase and eases integration with existing SD frameworks like SGLang.

The contributions of the paper are threefold: (1) Demonstrating that a minimalist cross‑attention decoder can attain competitive SD performance without any pooling or fusion modules; (2) Proposing a two‑stage block‑attention training scheme that guarantees stable convergence and constant memory usage; (3) Showing that the simplified architecture yields superior training efficiency and comparable inference speedups, thereby offering a more practical and portable alternative for speculative decoding.

Future work suggested includes scaling the number of attention heads, exploring multimodal cross‑attention for code or image captioning, extending the constant‑memory training tricks to models with tens of billions of parameters, and conducting real‑world latency and cost analyses in production environments. Beagle thus establishes a new paradigm that balances architectural simplicity with high‑performance speculative decoding, potentially reshaping how SD is adopted across diverse LLM ecosystems.


Comments & Academic Discussion

Loading comments...

Leave a Comment