Learning To Sample From Diffusion Models Via Inverse Reinforcement Learning
Diffusion models generate samples through an iterative denoising process, guided by a neural network. While training the denoiser on real-world data is computationally demanding, the sampling procedure itself is more flexible. This adaptability serves as a key lever in practice, enabling improvements in both the quality of generated samples and the efficiency of the sampling process. In this work, we introduce an inverse reinforcement learning framework for learning sampling strategies without retraining the denoiser. We formulate the diffusion sampling procedure as a discrete-time finite-horizon Markov Decision Process, where actions correspond to optional modifications of the sampling dynamics. To optimize action scheduling, we avoid defining an explicit reward function. Instead, we directly match the target behavior expected from the sampler using policy gradient techniques. We provide experimental evidence that this approach can improve the quality of samples generated by pretrained diffusion models and automatically tune sampling hyperparameters.
💡 Research Summary
The paper proposes a novel inverse reinforcement learning (IRL) framework for automatically optimizing the sampling hyper‑parameters of pretrained diffusion models without retraining the denoiser network. Diffusion generation proceeds by iteratively denoising a Gaussian sample according to a learned score network; however, the quality and efficiency of the final samples are heavily influenced by heuristic choices such as stochasticity injection, classifier‑free guidance, and restart (renoise) strategies. Existing methods either manually tune these knobs or define explicit reward functions for reinforcement learning, which can be brittle and dataset‑specific.
The authors cast the entire sampling process as a finite‑horizon Markov Decision Process (MDP). A state is defined as (s_t = (x_t, \sigma_t)), where (x_t) is the current image (or data vector) and (\sigma_t) is the noise level at step (t). The action space is discrete and encodes decisions such as which guidance scale (\omega) to use, how much stochastic noise (\gamma) to inject, or whether to trigger a restart and to which higher‑noise level. Transition dynamics are given by the standard diffusion update operator (H) modified according to the selected action; the highest noise level (\Sigma_N) is the initial distribution and the lowest level (\Sigma_0) is treated as an absorbing state.
Crucially, the method does not require an expert policy or expert actions. Instead, it leverages “expert state distributions”: real data samples and their noisy counterparts at selected noise levels. This state‑only supervision aligns with state‑marginal IRL, where the objective is to match the occupancy measure (\mu^\theta) induced by a policy (\pi_\theta) to the expert occupancy (\mu^E). The authors minimize an (f)-divergence (D_f(\mu^E|\mu^\theta)) directly, avoiding any reward learning step. They discuss several divergences (KL, reverse‑KL, total variation) and derive explicit decompositions for KL and reverse‑KL:
\
Comments & Academic Discussion
Loading comments...
Leave a Comment