Autoregressive Direct Preference Optimization

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Direct preference optimization (DPO) has emerged as a promising approach for aligning large language models (LLMs) with human preferences. However, the widespread reliance on the response-level Bradley-Terry (BT) model may limit its full potential, as the reference and learnable models are assumed to be autoregressive only after deriving the objective function. Motivated by this limitation, we revisit the theoretical foundations of DPO and propose a novel formulation that explicitly introduces the autoregressive assumption prior to applying the BT model. By reformulating and extending DPO, we derive a novel variant, termed Autoregressive DPO (ADPO), that explicitly integrates autoregressive modeling into the preference optimization framework. Without violating the theoretical foundations, the derived loss takes an elegant form: it shifts the summation operation in the DPO objective outside the log-sigmoid function. Furthermore, through theoretical analysis of ADPO, we show that there exist two length measures to be considered when designing DPO-based algorithms: the token length $μ$ and the feedback length $μ$’. To the best of our knowledge, we are the first to explicitly distinguish these two measures and analyze their implications for preference optimization in LLMs.

💡 Research Summary

The paper revisits Direct Preference Optimization (DPO), a recent method for aligning large language models (LLMs) with human preferences without an explicit reward model. While DPO enjoys theoretical elegance and computational efficiency, it inherits a subtle inconsistency: the learned model πθ is assumed to be autoregressive (i.e., factorized token‑by‑token), yet the Boltzmann distribution p₂ that underlies the DPO loss is defined over complete responses. Consequently, the Bradley‑Terry (BT) preference model is applied only at the level of whole sequences, ignoring the autoregressive nature of modern LLMs.

To resolve this mismatch, the authors propose Autoregressive Direct Preference Optimization (ADPO). The key idea is to shift the autoregressive assumption from the model side to the preference model itself by defining a prefix closure Y* of the output space Y. For each prefix y≤i (the first i tokens of a response), a “prefix‑wise” reward r* (x, y≤i) is introduced. Two new energy functions are defined over these prefixes:

E₁(x, y≤i) = – r (x, y≤i) (likelihood energy)
E₂(x, y≤i) = – (1/β) r (x, y≤i) – log π_ref (y_i | y_{<i}, x) (posterior energy)

Because the reference model π_ref is already autoregressive, the resulting Boltzmann distribution p₂ becomes a product of token‑level factors, i.e., p₂(y|x) = ∏{i=1}^{T′} p₂(y_i | y{<i}, x). The BT preference model is also applied token‑wise, yielding a “prefix‑wise BT” distribution.

The ADPO loss, derived by maximizing the log‑probability of the preferred‑over‑dispreferred event under this prefix‑wise BT model, takes the form

L_ADPO = – E_{(x,Y)∼D}

Autoregressive Direct Preference Optimization

💡 Research Summary

Comments & Academic Discussion

Leave a Comment