Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present a fine-grained theoretical analysis of the performance gap between reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under a representation gap. Our study decomposes this gap into two sources: an explicit representation gap under exact optimization and an implicit representation gap under finite samples. In the exact optimization setting, we characterize how the relative capacities of the reward and policy model classes influence the final policy qualities. We show that RLHF, DPO, or online DPO can outperform one another depending on type of model mis-specifications. Notably, online DPO can outperform both RLHF and standard DPO when the reward and policy model classes are isomorphic and both mis-specified. In the approximate optimization setting, we provide a concrete construction where the ground-truth reward is implicitly sparse and show that RLHF requires significantly fewer samples than DPO to recover an effective reward model, highlighting a statistical advantage of two-stage learning. Together, these results provide a comprehensive understanding of the performance gap between RLHF and DPO under various settings, and offer practical insights into when each method is preferred.

💡 Research Summary

This paper provides a fine‑grained theoretical investigation of why reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) can exhibit different performance in practice. The authors decompose the performance gap into two sources: an explicit representation gap that appears when the model classes for the reward function and the policy are misspecified, and an implicit representation gap that arises from finite‑sample estimation error.

Exact optimization (infinite data).
Four model‑mis‑specification regimes are examined.

Strong reward and strong policy (realizable case). When both the true reward r* and the optimal policy π* lie inside the respective model classes, RLHF and DPO achieve the same optimal value V*; the only difference is in convergence speed and trajectory.
Strong reward, weak policy. The reward class can represent r* exactly but the policy class cannot represent π*. RLHF first learns the exact reward and then computes the best possible policy within the limited policy class, thus attaining a higher value than DPO, which directly fits a policy to preferences and suffers from a structural mismatch. The paper constructs concrete environments where V_RLHF > V_DPO and shows that even an online DPO with sophisticated samplers cannot close the gap.
Weak reward, strong policy. The policy class is expressive enough to contain π* but the reward class cannot represent r*. Here RLHF’s two‑stage pipeline suffers because the learned reward model is biased, leading to sub‑optimal policies, while DPO bypasses reward learning and can recover π* directly, achieving a higher value.
Both classes misspecified but isomorphic. When the reward and policy model families are structurally identical (isomorphic) yet both are limited, an online variant of DPO equipped with a PILAF sampler and adaptive step sizes can outperform both standard RLHF and offline DPO. This result highlights that online DPO can simultaneously mitigate reward‑model error and exploit the alignment between the two model families.

Approximate optimization (finite samples).
The authors then turn to statistical efficiency. They consider a linear reward setting where the true reward vector has dimension d and sparsity k. Under this construction, RLHF learns the reward via maximum‑likelihood estimation; because the reward is sparse, standard sparse‑regression techniques give an estimation error of order (\tilde O(k\log d / n)). In contrast, DPO implicitly estimates the reward while fitting the policy, and the error scales as (\Omega(\sqrt{d/n})), which does not benefit from sparsity. Consequently, for the same number of preference pairs n, RLHF can recover an effective reward model with far fewer samples than DPO, leading to a strictly better expected bandit value. The paper validates this separation with controlled experiments on synthetic data and small language models, confirming that RLHF attains higher policy quality with limited data while DPO lags behind.

Practical implications.

If the policy model is the bottleneck (e.g., a small decoder) but a high‑capacity reward model is available, RLHF is preferable.
If the reward model is the bottleneck (e.g., a tiny reward head) while a large policy model can be trained, DPO (especially its online version) is advantageous.
When both models share the same architecture or are isomorphic, online DPO can dominate both approaches.
In high‑dimensional, sparse‑reward regimes, the two‑stage RLHF pipeline offers a clear sample‑efficiency advantage.

Overall, the paper delivers a comprehensive taxonomy of when RLHF, DPO, or online DPO should be used, grounded in rigorous analysis of representation capacity and statistical sample complexity. It clarifies that the previously assumed equivalence under realizability does not hold in realistic settings and provides actionable guidance for designing preference‑based fine‑tuning pipelines for large language models.

Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO

💡 Research Summary

Comments & Academic Discussion

Leave a Comment