Alignment of Diffusion Model and Flow Matching for Text-to-Image Generation

Alignment of Diffusion Model and Flow Matching for Text-to-Image Generation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Diffusion models and flow matching have demonstrated remarkable success in text-to-image generation. While many existing alignment methods primarily focus on fine-tuning pre-trained generative models to maximize a given reward function, these approaches require extensive computational resources and may not generalize well across different objectives. In this work, we propose a novel alignment framework by leveraging the underlying nature of the alignment problem – sampling from reward-weighted distributions – and show that it applies to both diffusion models (via score guidance) and flow matching models (via velocity guidance). The score function (velocity field) required for the reward-weighted distribution can be decomposed into the pre-trained score (velocity field) plus a conditional expectation of the reward. For the alignment on the diffusion model, we identify a fundamental challenge: the adversarial nature of the guidance term can introduce undesirable artifacts in the generated images. Therefore, we propose a finetuning-free framework that trains a guidance network to estimate the conditional expectation of the reward. We achieve comparable performance to finetuning-based models with one-step generation with at least a 60% reduction in computational cost. For the alignment on flow matching, we propose a training-free framework that improves the generation quality without additional computational cost.


💡 Research Summary

The paper tackles the problem of aligning large text‑to‑image generative models with arbitrary reward functions without the heavy computational burden of fine‑tuning. It reframes alignment as sampling from a reward‑weighted distribution π_r(x|y) ∝ π_ref(x|y)·exp(r(x,y)/β), where π_ref is the pre‑trained model’s conditional distribution and r(x,y) is a reward that encodes human preferences, aesthetics, or any other objective. Under this formulation, the required new score (for diffusion) or velocity field (for flow‑matching) can be expressed as the sum of the pre‑trained score/velocity and the gradient of the log‑conditional expectation of the reward.

For diffusion models, the authors identify a critical issue: the guidance term derived from the reward is adversarial in nature. Directly adding the gradient of log E


Comments & Academic Discussion

Loading comments...

Leave a Comment