Di3PO - Diptych Diffusion DPO for Targeted Improvements in Image Generation

Di3PO - Diptych Diffusion DPO for Targeted Improvements in Image Generation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Existing methods for preference tuning of text-to-image (T2I) diffusion models often rely on computationally expensive generation steps to create positive and negative pairs of images. These approaches frequently yield training pairs that either lack meaningful differences, are expensive to sample and filter, or exhibit significant variance in irrelevant pixel regions, thereby degrading training efficiency. To address these limitations, we introduce “Di3PO”, a novel method for constructing positive and negative pairs that isolates specific regions targeted for improvement during preference tuning, while keeping the surrounding context in the image stable. We demonstrate the efficacy of our approach by applying it to the challenging task of text rendering in diffusion models, showcasing improvements over baseline methods of SFT and DPO.


💡 Research Summary

**
The paper introduces Di3PO (Diptych Diffusion DPO), a novel approach for constructing high‑quality positive‑negative image pairs that isolate a targeted region—specifically text rendering—while keeping the surrounding context unchanged. Existing preference‑tuning methods for text‑to‑image (T2I) diffusion models, such as Diffusion‑DPO, rely on generating pairs that often differ in background, lighting, or composition. This “visual inconsistency” introduces confounding signals, wastes computational resources, and hampers the model’s ability to learn the precise attribute that should be improved.

Di3PO solves this problem by leveraging diptych prompting, a technique that asks a generative model to produce a two‑panel image where both panels share the exact same background but differ only in a small, controlled way. The authors build a fully automated pipeline: (1) they start from a seed list of correctly spelled words and programmatically create misspelled variants by altering 20 % of the characters; (2) a large language model (Gemini 2.5) generates a rich, diverse description of a background scene for each word pair; (3) the background description is combined with a diptych prompt template that explicitly requests the model to render the correct word in the left panel and the misspelled word in the right panel, guaranteeing that the only substantive difference is the quality of the rendered text; (4) a single wide image containing both panels is generated, then split into two separate images using Canny edge detection (with a fallback to a simple midpoint split).

To ensure the dataset’s reliability, a multimodal verification step is added. Gemini 2.5 is prompted to act as a human rater, checking that (i) the backgrounds of the two images are identical and (ii) the text differs slightly but is present in both images. Only pairs that receive a “pass” decision with a confidence score above a chosen threshold (e.g., 70 %) are retained. Using this pipeline, the authors produce 300 diptych pairs for the text‑rendering task.

The theoretical contribution is a clear analysis of how diptych pairs affect the DPO gradient. In standard DPO, the loss is L_DPO = −E


Comments & Academic Discussion

Loading comments...

Leave a Comment