DeDPO: Debiased Direct Preference Optimization for Diffusion Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Direct Preference Optimization (DPO) has emerged as a predominant alignment method for diffusion models, facilitating off-policy training without explicit reward modeling. However, its reliance on large-scale, high-quality human preference labels presents a severe cost and scalability bottleneck. To overcome this, We propose a semi-supervised framework augmenting limited human data with a large corpus of unlabeled pairs annotated via cost-effective synthetic AI feedback. Our paper introduces Debiased DPO (DeDPO), which uniquely integrates a debiased estimation technique from causal inference into the DPO objective. By explicitly identifying and correcting the systematic bias and noise inherent in synthetic annotators, DeDPO ensures robust learning from imperfect feedback sources, including self-training and Vision-Language Models (VLMs). Experiments demonstrate that DeDPO is robust to the variations in synthetic labeling methods, achieving performance that matches and occasionally exceeds the theoretical upper bound of models trained on fully human-labeled data. This establishes DeDPO as a scalable solution for human-AI alignment using inexpensive synthetic supervision.

💡 Research Summary

Direct Preference Optimization (DPO) has become a popular method for aligning text‑to‑image diffusion models with human preferences because it avoids training a separate reward model. However, DPO’s scalability is limited by the need for large amounts of high‑quality human‑annotated preference pairs, which are expensive to collect. In this paper the authors introduce DeDPO, a semi‑supervised framework that augments a small set of human‑labeled pairs with a much larger pool of unlabeled image‑text pairs that are annotated automatically using cheap synthetic AI feedback.

The key technical contribution is the integration of a doubly‑robust (debiasing) estimator, originally from causal inference, into the DPO loss. By re‑casting DPO as a binary classification problem, the authors propose a loss consisting of three terms: (1) the standard cross‑entropy on the human‑labeled data, (2) a cross‑entropy on the synthetic labels, and (3) a correction term that adjusts the synthetic label toward the true human label for the labeled examples. The correction term is weighted by ((n_l+n_u)/n_l), which amplifies its effect when human labels are scarce. They prove that the expectation of this combined loss equals the original DPO loss regardless of the quality of the synthetic labels, guaranteeing an unbiased estimator of the true objective. When synthetic labels are perfect the loss reduces to using all data; when they are completely wrong it falls back to using only the human data.

Two practical ways to generate synthetic preferences are explored. The first uses pretrained vision‑language models such as CLIP or aesthetic scoring networks to produce fixed pseudo‑labels for the unlabeled pool. The second adopts a self‑training scheme: the model from the previous training iteration supplies pseudo‑labels for the current iteration, with a confidence threshold that discards uncertain predictions. This self‑training approach allows the model to bootstrap from its own improving policy while limiting noise propagation.

The authors provide a theoretical analysis of convergence, showing that even if the synthetic label model converges slowly, the overall parameter update for the diffusion model remains robust. Empirically, DeDPO is evaluated on a subset of LAION‑5B where only 1 % (or less) of the pairs have human preferences. Across multiple synthetic feedback sources, DeDPO matches or exceeds the performance of a fully supervised DPO trained on all human labels. Metrics include CLIPScore for text‑image alignment, aesthetic scores, and human preference surveys. Notably, when human labels are reduced to 0.5 % of the dataset, DeDPO still retains most of the alignment quality, whereas vanilla DPO degrades sharply.

Compared with prior robust DPO variants (label smoothing, distributionally robust optimization) and reward‑model‑based pipelines, DeDPO does not assume a specific noise model, retains a fully offline training pipeline, and leverages the small high‑quality human set as a bias‑correction reference rather than merely a regularizer. The paper also discusses limitations: heavily biased synthetic annotators could still mislead training, and the current experiments focus on image generation; extending to video, 3D, or multimodal generation remains future work.

In summary, DeDPO demonstrates that a principled debiasing technique can make diffusion model alignment both cost‑effective and robust to noisy synthetic supervision, opening a scalable path toward large‑scale human‑AI alignment without the prohibitive expense of massive human preference datasets.

DeDPO: Debiased Direct Preference Optimization for Diffusion Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment