Iterative On-Policy Refinement of Hierarchical Diffusion Policies for Language-Conditioned Manipulation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Hierarchical policies for language-conditioned manipulation decompose tasks into subgoals, where a high-level planner guides a low-level controller. However, these hierarchical agents often fail because the planner generates subgoals without considering the actual limitations of the controller. Existing solutions attempt to bridge this gap via intermediate modules or shared representations, but they remain limited by their reliance on fixed offline datasets. We propose HD-ExpIt, a framework for iterative fine-tuning of hierarchical diffusion policies via environment feedback. HD-ExpIt organizes training into a self-reinforcing cycle: it utilizes diffusion-based planning to autonomously discover successful behaviors, which are then distilled back into the hierarchical policy. This loop enables both components to improve while implicitly grounding the planner in the controller’s actual capabilities without requiring explicit proxy models. Empirically, HD-ExpIt significantly improves hierarchical policies trained solely on offline data, achieving state-of-the-art performance on the long-horizon CALVIN benchmark among methods trained from scratch.

💡 Research Summary

The paper tackles a fundamental problem in language‑conditioned robotic manipulation: the mismatch between a high‑level (HL) planner that proposes subgoals and a low‑level (LL) controller that must actually achieve those subgoals. Existing solutions try to bridge this gap with intermediate “glue” modules or shared latent spaces, but they rely on fixed offline datasets, require extra proxy models, and often suffer from instability or inference overhead.

HD‑ExpIt (Hierarchical Diffusion with Expert Iteration) introduces a simple yet powerful training loop that iteratively refines both components using only environment feedback. The HL is a diffusion model that, conditioned on the language instruction and the initial visual observation, generates a full sequence of visual subgoals in one shot. The LL is a goal‑conditioned policy that, given a current observation and a target subgoal observation, outputs a short chunk of robot actions.

Training proceeds in three stages per iteration:

Supervised Update – Both HL and LL are trained on the current dataset Dₜ, which initially consists of a static offline collection of successful trajectories.
On‑policy Data Collection – The current hierarchical policy is deployed in the environment. The stochastic nature of the diffusion planner is exploited as an implicit generative search: multiple samples of subgoal sequences are drawn for each task, and the LL attempts to execute each transition. Binary success/failure feedback from the environment filters out infeasible plans; only fully successful rollouts are kept as new trajectories Rₜ.
Dataset Aggregation – Rₜ is merged with Dₜ to form Dₜ₊₁, which becomes the training set for the next iteration.

Because the diffusion planner’s randomness naturally explores a wide variety of plans, the loop discovers behaviors that were absent from the original offline data but are nonetheless realizable by the current LL. Over successive iterations the planner’s distribution shifts toward subgoals that the controller can reliably achieve, while the controller benefits from richer, more realistic training examples. No explicit model of LL capabilities or shared representation is required, and the entire refinement relies on straightforward supervised learning, avoiding the high variance of policy‑gradient RL.

Empirical evaluation is conducted on two benchmarks: a simulated Franka‑3Blocks environment and the long‑horizon CALVIN suite. On CALVIN, which demands execution of five consecutive language‑specified tasks, HD‑ExpIt more than doubles the success rate compared to a baseline trained only on the static offline dataset. Moreover, it outperforms all prior hierarchical methods that were trained from scratch, establishing a new state‑of‑the‑art. Ablation studies show that after 5–6 iterations, over 70 % of the sampled subgoal transitions are accepted, indicating a rapid reduction of the HL‑LL coupling mismatch.

The paper’s contributions are threefold: (1) a self‑reinforcing training loop that treats the diffusion planner as a stochastic search expert, (2) an implicit alignment mechanism that grounds high‑level planning in the actual low‑level capabilities via binary environment feedback, and (3) a thorough empirical demonstration that this approach yields stable learning, low computational overhead, and superior performance on challenging multi‑task, long‑horizon manipulation problems.

In summary, HD‑ExpIt offers a principled, data‑efficient pathway to close the gap between planning and execution in language‑conditioned robotic systems, paving the way for more reliable and scalable autonomous manipulation in real‑world settings.

Iterative On-Policy Refinement of Hierarchical Diffusion Policies for Language-Conditioned Manipulation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment