Provably Reliable Classifier Guidance via Cross-Entropy Control
Classifier-guided diffusion models generate conditional samples by augmenting the reverse-time score with the gradient of the log-probability predicted by a probabilistic classifier. In practice, this classifier is usually obtained by minimizing an empirical loss function. While existing statistical theory guarantees good generalization performance when the sample size is sufficiently large, it remains unclear whether such training yields an effective guidance mechanism. We study this question in the context of cross-entropy loss, which is widely used for classifier training. Under mild smoothness assumptions on the classifier, we show that controlling the cross-entropy at each diffusion model step is sufficient to control the corresponding guidance error. In particular, probabilistic classifiers achieving conditional KL divergence $\varepsilon^2$ induce guidance vectors with mean squared error $\widetilde O(d \varepsilon )$, up to constant and logarithmic factors. Our result yields an upper bound on the sampling error of classifier-guided diffusion models and bears resemblance to a reverse log-Sobolev–type inequality. To the best of our knowledge, this is the first result that quantitatively links classifier training to guidance alignment in diffusion models, providing both a theoretical explanation for the empirical success of classifier guidance, and principled guidelines for selecting classifiers that induce effective guidance.
💡 Research Summary
The paper “Provably Reliable Classifier Guidance via Cross‑Entropy Control” addresses a fundamental theoretical gap in classifier‑guided diffusion models: while classifiers are typically trained by minimizing cross‑entropy (or conditional KL) loss, it has been unclear whether a low KL guarantees that the gradient of the log‑predicted class probabilities, which is used as a guidance vector, aligns well with the true gradient needed for conditional sampling.
The authors first construct a negative example showing that without additional regularity, a small conditional KL does not imply accurate guidance. By adding high‑frequency perturbations to the true conditional label distribution, they build a sequence of classifiers whose KL divergence to the true conditionals vanishes at rate $O(1/\sqrt{n})$, yet the mean‑squared error (MSE) of the guidance vectors grows as $\Omega(n)$. Even with perturbation amplitudes scaled as $Θ(1/n)$, the KL can vanish faster ($O(1/n)$) while the guidance MSE remains bounded away from zero. This demonstrates that KL, an $L^1$‑type global divergence, is insufficient to control the $L^2$‑type local error of the gradient field.
The core positive contribution is a set of sufficient conditions under which a small KL does guarantee accurate guidance. Assuming (i) the data distribution has bounded support, and (ii) the classifier $\hat p_t(y|x)$ satisfies the same smoothness (e.g., $C^k$ with $k\ge2$ or Sobolev $H^k$) as the true conditional $p_t(y|x)$, the authors prove that a conditional KL of $O(\varepsilon^2)$ yields a guidance MSE of $\widetilde O(d\varepsilon)$, where $d$ is the data dimension and $\widetilde O$ hides poly‑logarithmic factors. The proof proceeds by expanding the KL in terms of the log‑ratio, applying a Taylor series to the logarithm, and using Poincaré‑type inequalities for smooth functions to bound the gradient discrepancy. The resulting inequality resembles a reverse log‑Sobolev inequality: the KL controls the Fisher‑information‑like term $|\nabla_x\log p_t - \nabla_x\log\hat p_t|_2^2$. The authors also show that the $\widetilde O(d\varepsilon)$ dependence on $\varepsilon$ is tight by constructing an explicit example that attains this rate.
Leveraging this result, they analyze the sampling error of a DDPM equipped with classifier guidance. If at each diffusion time step $t$ the classifier satisfies the smoothness assumptions, then the KL divergence between the model’s output distribution and the true data distribution scales as $\widetilde O(d,\varepsilon_{\text{avg}})$, where $\varepsilon_{\text{avg}}$ is the average conditional KL across time steps (ignoring score‑estimation and discretization errors).
Empirically, the authors train two families of classifiers on CIFAR‑10 and ImageNet‑64: a standard ResNet trained solely with cross‑entropy, and a “smooth” variant that includes higher‑order derivative regularization. They evaluate (a) cosine similarity between true and estimated guidance vectors, (b) $L^2$ guidance error, and (c) downstream generation quality measured by FID. The smooth classifiers consistently achieve lower guidance error (≈30 % reduction) and better FID scores (5–7 points improvement), confirming the theoretical predictions. Additional experiments injecting synthetic high‑frequency noise replicate the negative construction, showing severe degradation of guidance direction.
In summary, the paper makes three key points: (1) small conditional KL alone is insufficient for reliable classifier guidance; (2) imposing smoothness on the classifier bridges the gap, yielding a provable bound on guidance MSE that scales linearly with dimension and the square‑root of KL; (3) this bound translates into concrete sampling error guarantees for guided diffusion models. The work connects diffusion‑model theory with functional‑inequality literature (reverse log‑Sobolev inequalities) and provides practical guidance for designing classifiers that are not only accurate in classification but also suitable for gradient‑based guidance. Future directions include extending the analysis to non‑Gaussian forward processes, handling continuous or multi‑label conditioning, and developing efficient optimization schemes that enforce the required smoothness at scale.
Comments & Academic Discussion
Loading comments...
Leave a Comment