Accelerating Diffusion Planners in Offline RL via Reward-Aware Consistency Trajectory Distillation
Although diffusion models have achieved strong results in decision-making tasks, their slow inference speed remains a key limitation. While consistency models offer a potential solution, existing applications to decision-making either struggle with suboptimal demonstrations under behavior cloning or rely on complex concurrent training of multiple networks under the actor-critic framework. In this work, we propose a novel approach to consistency distillation for offline reinforcement learning that directly incorporates reward optimization into the distillation process. Our method achieves single-step sampling while generating higher-reward action trajectories through decoupled training and noise-free reward signals. Empirical evaluations on the Gym MuJoCo, FrankaKitchen, and long horizon planning benchmarks demonstrate that our approach can achieve a 9.7% improvement over previous state-of-the-art while offering up to 142x speedup over diffusion counterparts in inference time.
💡 Research Summary
This paper tackles the long‑standing bottleneck of diffusion‑based planners in offline reinforcement learning (RL): the need for many denoising steps during inference, which makes real‑time deployment impractical. While consistency models have shown that a multi‑step diffusion process can be distilled into a single‑step “student” model, prior attempts to bring this idea to decision‑making suffer from two fundamental issues. First, behavior‑cloning pipelines work well only when the offline dataset consists of expert demonstrations; with sub‑optimal or heterogeneous data they indiscriminately learn all modes, including low‑reward ones. Second, methods that embed diffusion or consistency models inside actor‑critic frameworks require concurrent training of multiple networks (actor, critic, diffusion teacher) and delicate hyper‑parameter balancing, leading to instability and high computational cost.
The authors propose Reward‑Aware Consistency Trajectory Distillation (RA‑CTD), a novel framework that directly injects a reward objective into the consistency‑trajectory distillation process. The architecture consists of three independently trained components: (1) a pre‑trained diffusion planner (teacher) built with the Elucidated Diffusion Model (EDM) and trained using a pseudo‑Huber loss; (2) a frozen, differentiable return‑to‑go reward model (R_{\psi}) that predicts the discounted cumulative return given a state‑action pair; and (3) a student consistency model (G_{\theta}) that learns to jump from any noisy timestep (t) directly to the clean action sequence at time 0.
Training loss combines three terms:
- CTM loss ((L_{CTM})) enforces “anytime‑to‑anytime” consistency by aligning the student’s direct prediction from (t) to (k) with a two‑stage path that first uses the teacher’s numerical solver from (t) to an intermediate (u) and then the student from (u) to (k).
- DSM loss ((L_{DSM})) is the standard denoising score matching loss that keeps the student’s output close to the clean data distribution.
- Reward loss ((L_{Reward} = -R_{\psi}(s_n,\hat a_n))) encourages the student to generate the first action (\hat a_n) of its predicted trajectory that maximizes the estimated return.
The total objective is (L = \alpha L_{CTM} + \beta L_{DSM} + \sigma L_{Reward}), with scalar weights (\alpha,\beta,\sigma) tuned empirically. Because the student operates in a noise‑free (clean) space, the reward model does not need to be noise‑aware, avoiding the complications of classifier‑guided diffusion. Moreover, the three modules are trained decoupled: the teacher diffusion model is fixed after pre‑training, the reward model is trained separately on the offline dataset, and only the student is updated during distillation. This eliminates the need for simultaneous multi‑network optimization, dramatically simplifying hyper‑parameter search and improving training stability.
Empirical evaluation spans three benchmark families: (i) D4RL MuJoCo continuous control tasks, (ii) FrankaKitchen robotic manipulation, and (iii) Maze2D long‑horizon planning. Across all environments, RA‑CTD achieves an average 9.7 % improvement in normalized return over the previous state‑of‑the‑art diffusion/consistency planners and actor‑critic baselines. In terms of inference efficiency, the single‑step student yields up to 142× speed‑up compared to the original multi‑step diffusion sampler, making it viable for latency‑sensitive applications. Notably, on “medium‑replay” datasets where the offline buffer contains many sub‑optimal trajectories, the reward‑aware loss successfully biases the student toward high‑reward modes, confirming the effectiveness of the proposed mode‑selection mechanism.
The paper’s contributions are threefold:
- Methodological innovation – integrating a differentiable reward objective into consistency trajectory distillation, thereby steering the distilled model toward high‑reward behavior rather than merely mimicking the teacher’s full multimodal distribution.
- Training decoupling – showing that a frozen diffusion teacher and an independently trained reward model suffice, removing the need for concurrent actor‑critic training and simplifying the overall pipeline.
- Empirical validation – demonstrating that a single‑step consistency model can match or exceed the performance of multi‑step diffusion planners while delivering orders‑of‑magnitude faster inference, even on heterogeneous, sub‑optimal offline data.
Future directions suggested include incorporating uncertainty estimates from the reward model (e.g., Bayesian or ensemble approaches), extending the framework to multi‑agent settings, and applying the same distillation‑plus‑reward paradigm to planners that operate on high‑dimensional observations such as images or point clouds. Overall, RA‑CTD bridges the gap between the expressive power of diffusion models and the real‑time demands of offline RL, offering a practical pathway toward high‑performance, low‑latency decision‑making systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment