Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution
We introduce PEPO (Pessimistic Ensemble based Preference Optimization), a single-step Direct Preference Optimization (DPO)-like algorithm to mitigate the well-known over-optimization issue in preference learning without requiring the knowledge of the data-generating distribution or learning an explicit reward model. PEPO achieves pessimism via an ensemble of preference-optimized policies trained on disjoint data subsets and then aggregates them through a worst case construction that favors the agreement across models. In the tabular setting, PEPO achieves sample complexity guarantees depending only on a single-policy concentrability coefficient, thus avoiding the all-policy concentrability which affects the guarantees of algorithms prone to over-optimization, such as DPO. The theoretical findings are corroborated by a convincing practical performance, while retaining the simplicity and the practicality of DPO-style training.
💡 Research Summary
The paper tackles a pressing problem in the alignment of large language models (LLMs): Direct Preference Optimization (DPO), while simple and empirically strong, suffers from over‑optimization when the preference dataset is finite. Over‑optimization manifests as continued reduction of the training loss without corresponding improvements in generated text quality, and sometimes even degradation. Existing “pessimistic” variants such as χ²‑PO or DPO+SFT mitigate this issue, but they rely on knowledge of the data‑generating distribution π_data or on a well‑covered reference policy π_base—assumptions that often fail in practice because preference data may be produced by proprietary models whose logits are unavailable.
The authors answer the open question “Can we design a DPO‑like algorithm that provably avoids over‑optimization without any knowledge of π_data?” by introducing PEPO (Pessimistic Ensemble based Preference Optimization). PEPO combines three key ideas:
-
Pessimistic Loss Construction – Instead of the standard sigmoid σ used in DPO, PEPO employs a right‑shifted “pessimistic” sigmoid σ_pess(x,λ)=σ(x−log(1+λe^{x/2})). This function originates from a Bradley‑Terry model with ties, where λ is a tie‑weight that can depend on the prompt‑answer triplet. The shift effectively lowers the estimated win probability, guarding against optimistic over‑estimation of the unknown reward.
-
Ensemble over Disjoint Data Subsets – The full preference dataset D is randomly partitioned into L equal‑size, non‑overlapping subsets {D₁,…,D_L}. For each subset a lightweight LoRA‑adapted policy π̃_ℓ is trained by maximizing a pessimistic DPO objective L_pessDPO(π;D_ℓ) that replaces the ordinary sigmoid with σ_pess. Because each policy only sees a distinct slice of the data, any single policy cannot over‑fit the entire dataset.
-
Worst‑Case Aggregation – After training, the final policy π_out is defined by a minimum‑over‑ensemble rule: for each action a and prompt x, π_out(a|x)=min_ℓ π̃_ℓ(a|x)·exp(−B·p_tie(x,a)/β), where p_tie is an upper bound on the tie probability. This construction forces the output to retain only those actions that all ensemble members assign relatively high probability, thereby implementing a pessimistic “worst‑case” selection.
Theoretical contributions focus on a tabular Markov Decision Process (MDP) setting. The authors prove that PEPO’s sample complexity depends only on a single‑policy concentrability coefficient C_π, yielding a bound of order O(C_π·log|Π|/ε²). This is a substantial improvement over standard DPO, which requires an all‑policy concentrability term that can be arbitrarily large when the dataset does not cover the whole policy class. Crucially, the analysis makes no assumption about π_data and does not require learning an explicit reward model.
Empirically, PEPO is evaluated on several open‑source LLMs ranging from 7 B to 34 B parameters (Zephyr‑7B, Llama‑3.1‑8B, Mistral‑7B, Yi‑34B). The authors compare PEPO against vanilla DPO using human evaluations, automatic metrics (ROUGE, BLEU), and a diagnostic over‑optimization metric (training loss vs. quality). Across all model sizes, PEPO consistently reduces over‑optimization signs and improves final quality by 3–7 % relative to DPO. The ensemble size L required for strong performance is modest (e.g., L = 5), and the proposed rejection‑sampling scheme for drawing from π_out adds less than 10 % overhead to inference time.
In summary, PEPO delivers four major contributions:
- Distribution‑Free Guarantees – avoids any reliance on the unknown data‑generating distribution.
- Pessimistic Ensemble Mechanism – combines a shifted sigmoid loss with a worst‑case aggregation to curb over‑optimization.
- Single‑Policy Concentrability Theory – provides provable sample‑complexity bounds under far weaker assumptions than prior DPO analyses.
- Practical Effectiveness – demonstrates consistent improvements on real‑world LLM fine‑tuning tasks with negligible computational cost.
The work opens a promising direction for safe preference‑based fine‑tuning of LLMs when the preference data originates from opaque or proprietary sources, offering both theoretical rigor and practical viability.
Comments & Academic Discussion
Loading comments...
Leave a Comment