Accelerating Large Language Model Inference with Self-Supervised Early Exits
This paper presents a modular approach to accelerate inference in large language models (LLMs) by adding early exit heads at intermediate transformer layers. Each head is trained in a self-supervised manner to mimic the main model’s predictions, allowing computation to stop early when a calibrated confidence threshold is reached. We evaluate several confidence metrics and show that entropy provides the most reliable separation between correct and incorrect predictions. Experiments on the Pythia model suite (70M to 2.8B parameters) demonstrate that our method significantly reduces inference cost while maintaining accuracy across multiple benchmarks. We further adapt this approach to speculative decoding, introducing Dynamic Self-Speculative Decoding (DSSD), which achieves 1.66x higher token acceptance than manually-tuned LayerSkip baselines with minimal hyperparameter tuning.
💡 Research Summary
This paper presents a novel and modular framework for accelerating the inference of large language models (LLMs) by integrating “early exit” mechanisms. The core idea addresses the inherent inefficiency of standard autoregressive generation, where computationally expensive full forward passes are performed for every token, regardless of its predictability. The proposed method attaches lightweight early exit heads (multi-layer perceptrons) to intermediate layers of a pre-trained transformer model (e.g., at the 1/5, 2/5, 3/5, and 4/5 points of the total layers). These auxiliary heads are trained in a self-supervised fashion to mimic the probability distribution of the final model output, using a Kullback-Leibler divergence loss. This training utilizes the same data distribution (MiniPile) as the original model pre-training, avoiding distribution shift issues and requiring no additional labeled data.
A critical contribution is the rigorous calibration process for making early exit decisions. The authors evaluate three confidence metrics—maximum probability, entropy, and top-2 probability difference—for determining when an early exit prediction is reliable. Through extensive analysis on the Pythia model suite (70M to 2.8B parameters), they consistently demonstrate that entropy provides the best separation between correct and incorrect predictions across all model sizes, as evidenced by the highest Area Under the Curve (AUC) in Receiver Operating Characteristic (ROC) analysis. A user-defined target accuracy level (ε) is then used on a held-out calibration set to determine a per-head entropy threshold. During inference, for each token, the model computes the output and entropy at each early exit layer. If the entropy falls below the calibrated threshold for that layer (indicating high confidence), the inference halts, and that prediction is used as the final output. Otherwise, computation proceeds to the next layer or the final model output.
The paper further innovates by adapting this early exit paradigm to speculative decoding, a technique that uses a small draft model to propose candidate token sequences which are then verified in parallel by the larger target model. The authors introduce Dynamic Self-Speculative Decoding (DSSD). Unlike prior work like LayerSkip, which manually selects a fixed intermediate layer as the draft model and requires extensive hyperparameter search for the optimal speculation length, DSSD dynamically decides the exit point (i.e., the draft model depth) for each token based on its real-time prediction confidence (entropy). Easy-to-predict tokens exit at shallower layers, acting as draft tokens, while harder tokens trigger deeper computation. This adaptive process is governed solely by the same accuracy threshold ε used in calibration.
Experimental results validate the effectiveness of both components. The basic early exit framework achieves significant reductions in computational cost (measured in FLOPs or layer traversals) while maintaining accuracy close to the base model across multiple benchmarks. The DSSD method showcases superior performance in the speculative decoding context. Compared to manually-tuned LayerSkip baselines, DSSD achieves up to 1.66x higher token acceptance rates with minimal hyperparameter tuning (only the ε threshold). This translates to generating more verified tokens per unit of computational effort. The work concludes that the proposed self-supervised early exit approach provides a practical, plug-and-play method for accelerating LLM inference. It leverages the model’s own knowledge, requires no architectural changes to the frozen backbone, and offers a principled trade-off between speed and accuracy through calibrated confidence thresholds. The extension to DSSD presents a particularly efficient and adaptive strategy for speculative decoding, promising faster generation times for large language models.
Comments & Academic Discussion
Loading comments...
Leave a Comment