Aligning Few-Step Diffusion Models with Dense Reward Difference Learning
Few-step diffusion models enable efficient high-resolution image synthesis but struggle to align with specific downstream objectives due to limitations of existing reinforcement learning (RL) methods in low-step regimes with limited state spaces and suboptimal sample quality. To address this, we propose Stepwise Diffusion Policy Optimization (SDPO), a novel RL framework tailored for few-step diffusion models. SDPO introduces a dual-state trajectory sampling mechanism, tracking both noisy and predicted clean states at each step to provide dense reward feedback and enable low-variance, mixed-step optimization. For further efficiency, we develop a latent similarity-based dense reward prediction strategy to minimize costly dense reward queries. Leveraging these dense rewards, SDPO optimizes a dense reward difference learning objective that enables more frequent and granular policy updates. Additional refinements, including stepwise advantage estimates, temporal importance weighting, and step-shuffled gradient updates, further enhance long-term dependency, low-step priority, and gradient stability. Our experiments demonstrate that SDPO consistently delivers superior reward-aligned results across diverse few-step settings and tasks. Code is available at https://github.com/ZiyiZhang27/sdpo.
💡 Research Summary
The paper tackles a pressing limitation of modern few‑step diffusion models: while they can synthesize high‑resolution images in as few as one to four denoising steps, they do not automatically align with downstream objectives such as aesthetic quality, user preference, or any reward‑based metric. Existing reinforcement‑learning (RL) fine‑tuning approaches (e.g., DDPO, DPO‑style methods) were designed for standard diffusion pipelines that run 20‑50 steps. When naively applied to the low‑step regime, these methods suffer from an extremely limited state space, poor sample quality, and sparse final‑step rewards, which together cause high variance, unstable training, and over‑fitting to longer trajectories.
To overcome these issues, the authors propose Stepwise Diffusion Policy Optimization (SDPO), a novel RL framework specifically engineered for few‑step diffusion. The core innovation is a dual‑state trajectory sampling mechanism. In each denoising step the model simultaneously records (i) the noisy latent xₜ and (ii) the predicted clean image \hat{x}ₜ⁰ that the diffusion network outputs as an intermediate estimate of the final image. Because few‑step models are distilled to have strong single‑step denoising power, \hat{x}ₜ⁰ remains a reliable proxy for image quality even at early steps. By applying the external reward function R to every \hat{x}ₜ⁰, the authors obtain a dense reward signal r(sₜ,aₜ)=R(\hat{x}ₜ⁰,c) for each step, eliminating the reliance on a single terminal reward.
Querying the reward function at every step would be prohibitively expensive. SDPO therefore introduces a latent‑similarity based reward prediction scheme. The method selects a small set of anchor steps (e.g., first, middle, last) and queries the true reward only at these points. For all other steps it interpolates rewards using cosine similarity between the latent representations of \hat{x}ₜ⁰ and the anchor representations, assuming Lipschitz continuity of the reward in latent space. This dramatically reduces the number of costly reward evaluations while preserving a smooth, high‑fidelity dense reward trajectory.
With dense rewards in hand, the authors formulate a Dense Reward Difference Learning objective. Instead of matching overall trajectory returns, SDPO aligns the stepwise reward differences Δ\hat{A}_t = \hat{R}t – \hat{R}{t‑1} with the corresponding log‑likelihood ratio differences Δ\tilde{ρ}_t computed from the current policy and a reference policy. The loss for step t is
L_t = λ(T‑t‑1)/η · (Δ\hat{A}_t – Δ\tilde{ρ}_t)²,
where λ is a temporal importance weight that emphasizes early steps (critical in few‑step regimes) and η scales the log‑ratio term. This formulation enables granular, per‑step policy updates, leading to faster convergence and more precise alignment with the reward.
To further stabilize training, SDPO incorporates three auxiliary techniques:
- Stepwise advantage estimation without discounting, preserving long‑term dependencies even when the horizon is only a few steps.
- Temporal importance weighting, which assigns larger weights to low‑step updates, ensuring the model does not neglect the early denoising decisions that dominate final image quality in few‑step settings.
- Step‑shuffled gradient updates, where the order of steps within a minibatch is randomly permuted before back‑propagation, reducing gradient correlation and variance.
The experimental suite evaluates SDPO on several few‑step diffusion backbones (1, 2, 4, and 8 steps) using the same training budget as the baseline DDPO. Rewards are measured with PickScore (a state‑of‑the‑art text‑to‑image preference model), and image quality is additionally assessed via CLIP‑Score and FID. Across all configurations, SDPO consistently outperforms DDPO, achieving 12 %–25 % higher PickScore and notable improvements in CLIP‑Score/FID, especially in the extreme 1‑step and 2‑step regimes where DDPO’s samples become noticeably blurry. Qualitative examples demonstrate that SDPO faithfully renders complex prompts (e.g., “cyberpunk cat wearing black leather jacket”) with sharp details, whereas DDPO either fails to capture the prompt or produces low‑fidelity outputs.
In summary, the paper makes four major contributions:
- A dual‑state sampling strategy that yields dense, low‑variance rewards for every denoising step.
- A latent‑similarity based reward prediction mechanism that dramatically cuts down reward query cost.
- A dense reward difference learning objective that enables per‑step policy refinement.
- An integrated SDPO framework that combines stepwise advantage, temporal importance weighting, and step‑shuffled updates for robust optimization in ultra‑low‑step diffusion.
By bridging the gap between few‑step diffusion efficiency and reward‑driven alignment, SDPO opens the door to practical, high‑quality, preference‑aware image generation with minimal computational overhead, a development poised to impact both research and commercial applications of generative AI.
Comments & Academic Discussion
Loading comments...
Leave a Comment