Push Smarter, Not Harder: Hierarchical RL-Diffusion Policy for Efficient Nonprehensile Manipulation
Nonprehensile manipulation, such as pushing objects across cluttered environments, presents a challenging control problem due to complex contact dynamics and long-horizon planning requirements. In this work, we propose HeRD, a hierarchical reinforcement learning-diffusion policy that decomposes pushing tasks into two levels: high-level goal selection and low-level trajectory generation. We employ a high-level reinforcement learning (RL) agent to select intermediate spatial goals, and a low-level goal-conditioned diffusion model to generate feasible, efficient trajectories to reach them. This architecture combines the long-term reward maximizing behaviour of RL with the generative capabilities of diffusion models. We evaluate our method in a 2D simulation environment and show that it outperforms the state-of-the-art baseline in success rate, path efficiency, and generalization across multiple environment configurations. Our results suggest that hierarchical control with generative low-level planning is a promising direction for scalable, goal-directed nonprehensile manipulation. Code, documentation, and trained models are available: https://github.com/carosteven/HeRD.
💡 Research Summary
This paper, “Push Smarter, Not Harder: Hierarchical RL-Diffusion Policy for Efficient Nonprehensile Manipulation,” introduces a novel hierarchical control framework named HeRD (Hierarchical Reinforcement Learning-Diffusion Policy) for solving nonprehensile manipulation tasks, specifically pushing multiple boxes to a receptacle in cluttered 2D environments.
Nonprehensile manipulation, such as pushing, is challenging due to complex contact dynamics and the need for long-horizon planning. Traditional Reinforcement Learning (RL) approaches that output low-level control commands force the agent to learn both task strategy and robot dynamics simultaneously. The Spatial Action Maps (SAM) method alleviates this by having a high-level RL policy output spatial goal coordinates, which are then executed by a separate low-level controller following a path from a shortest-path algorithm. However, this low-level path is often geometrically simple and ignores environmental context, leading to potential inefficiencies or failures.
The core innovation of HeRD is its explicit two-level hierarchy that strategically combines the strengths of RL and generative models. The high-level policy is a Double Deep Q-Network (DDQN) with a SAM action space. It takes a multi-channel semantic map of the environment (including obstacles, boxes, the receptacle, and the robot) and selects an intermediate spatial goal pixel. The key departure from SAM lies in the low-level controller. HeRD employs a conditional strategy: if the shortest path from the robot to the selected goal intersects with any movable boxes (implying a pushing action is needed), it uses the original proportional controller. If the path is clear of boxes (implying only navigation or strategic positioning is required), it activates a learned goal-conditioned diffusion policy.
This diffusion policy is a denoising diffusion probabilistic model trained on human demonstration data. It generates smooth, feasible, and context-aware trajectories by iteratively denoising a noise sequence into a robot action plan. The model is conditioned on the goal coordinates via “goal inpainting” (fixing the trajectory endpoints) and on the environmental observation via FiLM layers, ensuring the generated path respects geometry and avoids collisions. The rationale for this conditional design is insightful: RL can effectively learn from the strong, immediate reward signal provided when pushing boxes (progress reward). However, the subtle strategies for efficient navigation and setup—such as strategically avoiding boxes or positioning for a future push—are difficult to encode into a reward function but are naturally captured in human demonstrations. The diffusion policy leverages this human intuition.
The method was evaluated in a 2D simulation with various environment configurations containing static obstacles and multiple boxes. HeRD was compared against the state-of-the-art SAM baseline. Results demonstrated that HeRD significantly outperformed SAM in terms of success rate (the probability of placing all boxes in the receptacle). Furthermore, HeRD achieved this success with a lower total distance traveled by the robot, indicating that the diffusion policy generated smarter, more efficient navigation trajectories than the simple shortest-path approach. The diffusion policy’s ability to implicitly understand feasibility and produce human-like maneuvers led to more effective overall task execution.
In conclusion, HeRD presents a promising hybrid architecture for scalable robotic manipulation. It successfully marries the long-term reward maximization capability of hierarchical RL with the rich, context-aware trajectory generation of diffusion models learned from demonstrations. This work highlights the potential of using generative models as intelligent low-level planners within a hierarchical control framework to tackle complex, contact-rich tasks where designing perfect reward functions is challenging.
Comments & Academic Discussion
Loading comments...
Leave a Comment