Dichotomous Diffusion Policy Optimization

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Diffusion-based policies have gained growing popularity in solving a wide range of decision-making tasks due to their superior expressiveness and controllable generation during inference. However, effectively training large diffusion policies using reinforcement learning (RL) remains challenging. Existing methods either suffer from unstable training due to directly maximizing value objectives, or face computational issues due to relying on crude Gaussian likelihood approximation, which requires a large amount of sufficiently small denoising steps. In this work, we propose DIPOLE (Dichotomous diffusion Policy improvement), a novel RL algorithm designed for stable and controllable diffusion policy optimization. We begin by revisiting the KL-regularized objective in RL, which offers a desirable weighted regression objective for diffusion policy extraction, but often struggles to balance greediness and stability. We then formulate a greedified policy regularization scheme, which naturally enables decomposing the optimal policy into a pair of stably learned dichotomous policies: one aims at reward maximization, and the other focuses on reward minimization. Under such a design, optimized actions can be generated by linearly combining the scores of dichotomous policies during inference, thereby enabling flexible control over the level of greediness.Evaluations in offline and offline-to-online RL settings on ExORL and OGBench demonstrate the effectiveness of our approach. We also use DIPOLE to train a large vision-language-action (VLA) model for end-to-end autonomous driving (AD) and evaluate it on the large-scale real-world AD benchmark NAVSIM, highlighting its potential for complex real-world applications.

💡 Research Summary

The paper introduces DIPOLE (Dichotomous diffusion Policy improvement), a novel reinforcement‑learning algorithm designed to train large diffusion‑based policies in a stable and controllable manner. Diffusion models have become popular for policy representation because they can capture multimodal action distributions and allow controllable generation at inference time. However, existing approaches to fine‑tune diffusion policies with RL suffer from two major drawbacks. First, directly back‑propagating reward or value gradients through the multi‑step denoising process leads to noisy, unstable updates and incurs prohibitive computational cost. Second, methods that approximate the likelihood of intermediate diffusion steps with crude Gaussian assumptions require a very large number of small denoising steps, making training inefficient and prone to approximation error accumulation.

To address these issues, the authors revisit the KL‑regularized RL objective, which balances a reward‑weighted term G(s,a) with a KL divergence penalty that keeps the learned policy close to a reference policy µ. The closed‑form solution of the standard KL‑regularized problem is π∗(a|s) ∝ µ(a|s)·exp(β·G(s,a)). While a large temperature β yields a greedy policy, the exponential term grows explosively, causing loss divergence and instability. Moreover, the loss becomes dominated by a few high‑return samples, reducing data efficiency.

DIPOLE mitigates these problems by (1) replacing the unbounded exponential weight with a bounded sigmoid σ(β·G) and (2) introducing an explicit “greediness factor” ω that scales the exponential component separately. The resulting optimal policy is π∗(a|s) ∝ µ(a|s)·σ(β·G)·exp(ω·β·G). By exploiting the identity σ·(1−σ) = exp(·)·σ², the authors show that this expression can be factorized into a ratio of two weighted reference policies: a positive policy π⁺(a|s) ∝ µ(a|s)·σ(β·G) that emphasizes high‑reward actions, and a negative policy π⁻(a|s) ∝ µ(a|s)·(1−σ(β·G)) that emphasizes low‑reward actions. Both policies are bounded because they use sigmoid weights, eliminating the risk of loss explosion.

Training proceeds by learning two independent diffusion models (or flow‑matching models), one for π⁺ and one for π⁻, using the standard diffusion regression loss multiplied by the respective sigmoid weight. This approach requires only a simple modification to the base diffusion loss and does not involve any costly likelihood estimation or multi‑step back‑propagation. At inference time, actions are generated by linearly combining the scores of the two models according to the greediness factor ω, exactly mirroring the classifier‑free guidance technique widely used in diffusion sampling. Adjusting ω provides a smooth control knob that interpolates between conservative (low ω) and highly greedy (high ω) behavior.

Empirically, DIPOLE is evaluated on both offline RL (ExORL) and offline‑to‑online RL (OGBench) benchmarks. Across a variety of locomotion and manipulation tasks, DIPOLE consistently outperforms prior diffusion‑based RL methods, achieving faster convergence and higher final returns. The authors also scale the method to a large vision‑language‑action (VLA) model and test it on NAVSIM, a real‑world autonomous driving benchmark. The DIPOLE‑trained VLA model shows significant performance gains over a strong pre‑trained baseline, demonstrating the method’s applicability to complex, high‑dimensional real‑world decision‑making problems.

In summary, DIPOLE provides a theoretically grounded yet practically simple solution to the instability and inefficiency of diffusion policy RL. By decomposing the optimal policy into two bounded, dichotomous diffusion policies—one for reward maximization and one for reward minimization—and by leveraging a controllable greediness factor, DIPOLE achieves stable training, efficient use of both good and bad data, and flexible policy control, marking a substantial advance for diffusion‑based reinforcement learning.

Dichotomous Diffusion Policy Optimization

💡 Research Summary

Comments & Academic Discussion

Leave a Comment