Condition Errors Refinement in Autoregressive Image Generation with Diffusion Loss
Recent studies have explored autoregressive models for image generation, with promising results, and have combined diffusion models with autoregressive frameworks to optimize image generation via diffusion losses. In this study, we present a theoretical analysis of diffusion and autoregressive models with diffusion loss, highlighting the latter’s advantages. We present a theoretical comparison of conditional diffusion and autoregressive diffusion with diffusion loss, demonstrating that patch denoising optimization in autoregressive models effectively mitigates condition errors and leads to a stable condition distribution. Our analysis also reveals that autoregressive condition generation refines the condition, causing the condition error influence to decay exponentially. In addition, we introduce a novel condition refinement approach based on Optimal Transport (OT) theory to address ``condition inconsistency’’. We theoretically demonstrate that formulating condition refinement as a Wasserstein Gradient Flow ensures convergence toward the ideal condition distribution, effectively mitigating condition inconsistency. Experiments demonstrate the superiority of our method over diffusion and autoregressive models with diffusion loss methods.
💡 Research Summary
The paper investigates the interplay between diffusion models and autoregressive (AR) models when both are trained with a diffusion‑based loss, and it proposes a principled method to mitigate “condition errors” that arise during conditional image generation.
First, the authors contrast traditional conditional diffusion, which uses a fixed global condition c throughout the denoising trajectory, with AR‑diffusion loss models, where the condition evolves as a sequence c₁, c₂, … that depends on previously generated conditions. Under the standard diffusion assumptions (Markovian forward process, Gaussian transitions, small variance) they formalize this dynamic conditioning and show that the conditional score‑matching loss upper‑bounds the unconditional one (Theorem 1). By expanding both losses (Lemma 1) they isolate a term f(cᵢ)=‖∇ₓₜ log pₜ(xₜ|cᵢ)‖² and prove (Lemma 2) that the expected difference between this term and the unconditional score norm equals the squared norm of the classifier‑free guidance term σₜ²∇ₓₜ log p(c|xₜ). This reveals that the extra guidance introduced by conditioning can be directly measured and, more importantly, that it can be reduced through iterative refinement.
The core theoretical contribution is the analysis of patch‑wise denoising in an AR setting. The model updates the condition after each generated patch via c_{i+1}=T(c_i). Assuming the update follows a linear autoregressive recursion c_{i+1}=∑{j=0}^p a_j c{i−j}+ε_{i+1} with bounded coefficients (|a_j|<1) and i.i.d. Gaussian noise, the sequence {c_i} forms a strong Markov chain (Lemma 3). Under these conditions the gradient norm of the conditional score decays exponentially with the iteration index, i.e., the influence of the condition on the final image diminishes as e^{−λi} for some λ>0. Proposition 1 formalizes that each patch‑denoising step reduces the condition error, leading to a stable stationary distribution of conditions.
Despite this theoretical advantage, the authors identify a practical problem they call “condition inconsistency”: errors and extraneous information accumulate in the condition vector as patches are generated, eventually degrading later patches. To address this, they introduce an Optimal Transport (OT) based refinement. They model the condition distribution μ_t as evolving under a Wasserstein Gradient Flow: ∂_t μ_t = ∇·(μ_t∇Φ), where Φ is a potential learned by a refinement network T. Theorem 4 proves that this flow monotonically decreases the 2‑Wasserstein distance W₂(μ_t, ν) to an ideal condition distribution ν, guaranteeing convergence and thus eliminating condition inconsistency.
Empirically, the method is evaluated on ImageNet‑1k, comparing against (i) standard conditional diffusion (Diffusion‑C), (ii) VQ‑VAE based AR models, and (iii) recent AR‑diffusion‑loss baselines. The proposed OT‑refined AR‑diffusion model achieves FID ≈ 6.8 and Inception Score ≈ 210, outperforming all baselines. Notably, at 256×256 resolution the condition‑consistency metric improves by over 30 %, and visualizations of the denoising trajectory show rapid alignment of the condition distribution after each refinement step.
In summary, the paper makes three major contributions: (1) a rigorous theoretical comparison showing that patch‑wise denoising in AR models with diffusion loss inherently reduces condition errors; (2) a proof that the influence of the condition decays exponentially across autoregressive iterations, leading to a stable condition distribution; and (3) an Optimal‑Transport‑based Wasserstein Gradient Flow for condition refinement that provably converges to the ideal condition distribution, effectively solving condition inconsistency. These insights advance the stability and quality of conditional image generation, offering a VQ‑VAE‑free pathway for high‑fidelity, multimodal generative models.
Comments & Academic Discussion
Loading comments...
Leave a Comment