Temporal Guidance for Large Language Models

Temporal Guidance for Large Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Contrastive Decoding (CD) enhances the generation quality of large language models (LLMs) but incurs significant additional computational overhead due to the need for an auxiliary model. Existing internal self-contrastive decoding methods, such as Decoding by Contrasting Layers (DoLa), focus on discrepancies across different layers, which are notably unstable on small-scale models. In this work, based on the observation that LLMs exhibit local preferences, we propose a novel contrastive guidance strategy along the temporal dimension, namely Temporal Guidance (TeGu). Our method ingeniously leverages Multi-Token Prediction (MTP) to construct weaker amateur predictions for model self-contrast. To standardize the implementation of this mechanism, we further introduce a lightweight Conditional MTP Projector (cMTPP), which avoids maintaining multiple independent networks as required by other MTP modules. Across various model series and benchmarks, TeGu achieves significant performance improvements while maintaining low additional memory consumption and computational overhead.


💡 Research Summary

The paper tackles the inefficiencies of conventional Contrastive Decoding (CD), which requires a separate “amateur” model alongside a large “expert” model, leading to doubled KV‑cache usage, higher memory consumption, and increased latency. While internal contrastive methods such as Decoding by Contrasting Layers (DoLa) avoid a second model by treating shallow layers as amateurs, they suffer from instability on smaller architectures because shallow layers do not reliably produce weak signals.

The authors observe that large language models (LLMs) exhibit a strong locality bias: the most recent tokens dominate the prediction of the next token. Removing these recent tokens forces the model to rely on more generic, high‑entropy patterns, naturally creating a weaker distribution. Leveraging this, they propose Temporal Guidance (T​eGu), a decoding‑time algorithm that contrasts the standard next‑token prediction (NTP) – the “expert” – with predictions from Multi‑Token Prediction (MTP) heads that are conditioned on a historical context lacking the last k tokens – the “amateur”.

Formally, the expert distribution is (P_{\text{exp}}(x_t)=P(x_t|x_{<t})). For each offset (k) in a set (K), an amateur distribution (P(x_t|x_{<t-k})) is computed. These are combined with manually set weights (w_k) into a weighted mixture (P_{\text{amt}}(x_t)=\sum_{k\in K} w_k P(x_t|x_{<t-k})). The mixture is stabilized using the Log‑SumExp trick. The final guided log‑probability follows the classic CD formula:

\


Comments & Academic Discussion

Loading comments...

Leave a Comment