Displacement-Resistant Extensions of DPO with Nonconvex $f$-Divergences

Displacement-Resistant Extensions of DPO with Nonconvex $f$-Divergences
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

DPO and related algorithms align language models by directly optimizing the RLHF objective: find a policy that maximizes the Bradley-Terry reward while staying close to a reference policy through a KL divergence penalty. Previous work showed that this approach could be further generalized: the original problem remains tractable even if the KL divergence is replaced by a family of $f$-divergence with a convex generating function $f$. Our first contribution is to show that convexity of $f$ is not essential. Instead, we identify a more general condition, referred to as DPO-inducing, that precisely characterizes when the RLHF problem remains tractable. Our next contribution is to establish a second condition on $f$ that is necessary to prevent probability displacement, a known empirical phenomenon in which the probabilities of the winner and the loser responses approach zero. We refer to any $f$ that satisfies this condition as displacement-resistant. We finally focus on a specific DPO-inducing and displacement-resistant $f$, leading to our novel SquaredPO loss. Compared to DPO, this new loss offers stronger theoretical guarantees while performing competitively in practice.


💡 Research Summary

The paper revisits Direct Preference Optimization (DPO), a method that aligns language models by directly maximizing a Bradley‑Terry (BT) reward while penalizing deviation from a reference policy using a KL‑divergence term. Prior work showed that the KL can be replaced by any convex f‑divergence whose generator f is differentiable with an invertible derivative, preserving a closed‑form solution and yielding an “f‑DPO” loss.

The authors first demonstrate that convexity of f is not required for tractability. They introduce the notion of a DPO‑inducing function: a generator f for which substituting the optimal solution of the generalized RLHF objective (max πθ Eₓ


Comments & Academic Discussion

Loading comments...

Leave a Comment