Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models

Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Classifier-Free Guidance (CFG) is a fundamental technique in training conditional diffusion models. The common practice for CFG-based training is to use a single network to learn both conditional and unconditional noise prediction, with a small dropout rate for conditioning. However, we observe that the joint learning of unconditional noise with limited bandwidth in training results in poor priors for the unconditional case. More importantly, these poor unconditional noise predictions become a serious reason for degrading the quality of conditional generation. Inspired by the fact that most CFG-based conditional models are trained by fine-tuning a base model with better unconditional generation, we first show that simply replacing the unconditional noise in CFG with that predicted by the base model can significantly improve conditional generation. Furthermore, we show that a diffusion model other than the one the fine-tuned model was trained on can be used for unconditional noise replacement. We experimentally verify our claim with a range of CFG-based conditional models for both image and video generation, including Zero-1-to-3, Versatile Diffusion, DiT, DynamiCrafter, and InstructPix2Pix.


💡 Research Summary

Classifer‑Free Guidance (CFG) has become the de‑facto standard for conditioning diffusion models because it allows a single network to predict both conditional and unconditional noise, eliminating the need for an external classifier. In practice, many state‑of‑the‑art conditional generators (e.g., Zero‑1‑to‑3, Versatile Diffusion, InstructPix2Pix, DiT, DynamiCrafter) are obtained by fine‑tuning a large pretrained unconditional model such as Stable Diffusion. The authors observe a systematic degradation of the unconditional prior after fine‑tuning: the unconditional samples from the fine‑tuned model lose detail, exhibit lower fidelity, and, more importantly, this poor unconditional estimate harms the conditional generation because CFG combines the two predictions as ϵ(γ)=ϵ∅+γ(ϵc−ϵ∅). A degraded ϵ∅ introduces errors into the estimated marginal p(xₜ), which propagates to the conditional distribution p(c|xₜ) and ultimately reduces image quality and text‑image alignment.

To address this, the paper proposes a remarkably simple, training‑free fix: keep the conditional prediction from the fine‑tuned model but replace its unconditional prediction with that of a strong unconditional model (the “base” model). Formally, the new guided noise becomes ϵ(γ)θ,ψ = ϵψ(xₜ,∅) + γ·(ϵθ(xₜ,c) – ϵψ(xₜ,∅)), where ϵψ is the unconditional noise from the base model ψ and ϵθ is the conditional noise from the fine‑tuned model θ. During DDIM sampling the same algorithmic steps are used; only the source of the unconditional term changes.

The authors conduct extensive experiments across a wide spectrum of conditional diffusion models, covering text‑to‑image, image‑to‑image, text‑guided editing, and video synthesis. Quantitatively, the unconditional‑replacement strategy consistently lowers FID scores by roughly 10‑20 % and improves CLIP‑Score or other alignment metrics. Qualitatively, generated samples show sharper details, more accurate geometry, and better adherence to the conditioning signal. Notably, the base model does not have to be the exact one used for fine‑tuning; swapping in a different high‑quality unconditional model such as Stable Diffusion 2.1 or PixArt‑α yields further gains, demonstrating that the method is robust to architectural differences.

The paper situates its contribution relative to prior work on guidance (e.g., Autoguidance) and model merging (Diffusion Soup, Mix‑of‑Show, Max‑Fusion). Unlike those approaches, which either modify the guidance signal within a single model or blend weights, this method merges predictions from two models with different conditioning modalities, directly correcting the unconditional prior without any additional training or parameter overhead.

In summary, the key insights are: (1) fine‑tuning with CFG inevitably harms the unconditional prior because the network’s capacity is split between conditional and unconditional tasks and the unconditional dropout rate is low; (2) the quality of the unconditional prior is a critical factor for the overall performance of CFG‑based conditional generation; (3) simply reusing a strong unconditional prior from a base or any other pretrained diffusion model restores the marginal distribution and dramatically improves conditional generation. This finding suggests that future pipelines for adapting large diffusion models should treat the unconditional prior as a separate, reusable component rather than forcing the fine‑tuned network to relearn it.


Comments & Academic Discussion

Loading comments...

Leave a Comment