SP^2DPO: An LLM-assisted Semantic Per-Pair DPO Generalization

SP^2DPO: An LLM-assisted Semantic Per-Pair DPO Generalization
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Direct Preference Optimization (DPO) controls the trade-off between fitting preference labels and staying close to a reference model using a single global temperature beta, implicitly treating all preference pairs as equally informative. Real-world preference corpora are heterogeneous: they mix high-signal, objective failures (for example, safety, factuality, instruction violations) with low-signal or subjective distinctions (for example, style), and also include label noise. We introduce our method, SP2DPO (Semantic Per-Pair DPO), a generalization that replaces the global temperature with an instance-specific schedule beta_i pre-decided offline from structured semantic-gap annotations (category, magnitude, confidence) produced by teacher language models. We instantiate this procedure on the UltraFeedback preference corpus (59,960 pairs), enabling large-scale construction of an auditable beta_i artifact, and incur zero training-time overhead: the inner-loop optimizer remains standard DPO with beta set per pair. We focus our empirical study on AlpacaEval 2.0, reporting both raw win rate and length-controlled win rate. Across four open-weight, instruction-tuned student backbones (4B-8B), SP2DPO is competitive with a tuned global-beta DPO baseline and improves AlpacaEval 2.0 length-controlled win rate on two of four backbones, while avoiding per-model beta sweeps. All code, annotations, and artifacts will be released.


💡 Research Summary

Direct Preference Optimization (DPO) has emerged as a stable, reward‑model‑free alternative to traditional RLHF, directly maximizing the likelihood of preferred completions under a Bradley‑Terry formulation. The method relies on a single scalar temperature β to balance the strength of preference enforcement against a KL‑regularized pull toward a fixed reference policy. Implicitly, this treats every preference pair as equally informative, assuming a uniform signal‑to‑noise ratio across the dataset. In practice, real‑world feedback corpora are highly heterogeneous: they contain high‑signal, objective failures such as safety violations, factual hallucinations, or instruction breaches, mixed with low‑signal, subjective distinctions like style or tone, and they inevitably contain noisy or contradictory labels.

The paper introduces SP²DPO (Semantic Per‑Pair DPO), a generalization that replaces the global β with an instance‑specific temperature schedule {βᵢ}. The key insight is to move the control of enforcement strength from an online hyper‑parameter to an offline, auditable data artifact. Strong teacher LLMs are employed to annotate each preference triplet (x, y_w, y_l) with a structured “semantic gap” tuple: a dominant category (Safety, Factuality, Instruction, Reasoning, Helpfulness, Style), a magnitude score in


Comments & Academic Discussion

Loading comments...

Leave a Comment