Constrained Sampling to Guide Universal Manipulation RL
We consider how model-based solvers can be leveraged to guide training of a universal policy to control from any feasible start state to any feasible goal in a contact-rich manipulation setting. While Reinforcement Learning (RL) has demonstrated its strength in such settings, it may struggle to sufficiently explore and discover complex manipulation strategies, especially in sparse-reward settings. Our approach is based on the idea of a lower-dimensional manifold of feasible, likely-visited states during such manipulation and to guide RL with a sampler from this manifold. We propose Sample-Guided RL, which uses model-based constraint solvers to efficiently sample feasible configurations (satisfying differentiable collision, contact, and force constraints) and leverage them to guide RL for universal (goal-conditioned) manipulation policies. We study using this data directly to bias state visitation, as well as using black-box optimization of open-loop trajectories between random configurations to impose a state bias and optionally add a behavior cloning loss. In a minimalistic double sphere manipulation setting, Sample-Guided RL discovers complex manipulation strategies and achieves high success rates in reaching any statically stable state. In a more challenging panda arm setting, our approach achieves a significant success rate over a near-zero baseline, and demonstrates a breadth of complex whole-body-contact manipulation strategies.
💡 Research Summary
The paper addresses a fundamental challenge in contact‑rich robotic manipulation: learning a universal, goal‑conditioned policy that can move from any feasible start state to any feasible goal state. While model‑based planners exploit geometric and physical constraints to guide exploration, modern reinforcement learning (RL) treats the dynamics as a black box and often struggles with sparse rewards and inefficient exploration. To bridge this gap, the authors introduce the Constrained Goal‑conditioned MDP (CG‑MDP), an extension of the standard goal‑conditioned MDP that explicitly incorporates differentiable collision, contact, and force constraints. The feasible state set S_c is defined by inequality constraints g_c(s) ≤ 0 (e.g., non‑penetration) and equality constraints h_c(s) = 0 (e.g., static equilibrium), together with discrete contact‑mode variables c_ij. The start‑goal distribution p_0(s,g) is taken as uniform over S_c × S_c, which requires a dedicated sampler because sampling feasible configurations is a non‑convex problem.
The authors propose a two‑stage constrained sampling pipeline. First, contact modes are sampled randomly, limiting the number of active support contacts to three to keep the problem tractable. Second, for each sampled mode, a random point \bar{s} is drawn from a box prior and projected onto the feasible manifold using an Augmented Lagrangian solver. Repeating this process yields a dataset D_s of feasible states that captures the diversity of physically plausible configurations.
These samples are used to bias RL in two complementary ways. (1) Episode initialization: at the beginning of each episode, a start‑goal pair (s,g) is drawn from D_s, ensuring that the agent always begins and aims for physically realizable states. (2) Zero‑order open‑loop trajectory generation: given a random (s,g) pair, a low‑dimensional B‑spline control trajectory is optimized via black‑box methods (e.g., CMA‑ES) to minimize the Euclidean distance between feature vectors ϕ(g) and ϕ(x(T)). The feature vector includes object position, robot generalized coordinates, their velocities, and a continuous contact indicator c, which provides an implicit gradient toward the desired contact mode. The resulting open‑loop trajectories can be used for behavior cloning (BC) loss or simply as additional guidance for the policy.
The learning algorithm builds on Soft Actor‑Critic (SAC) for goal‑conditioned RL, augmented with the above state‑biasing mechanisms, optional BC, reward relabeling, and shaping. Experiments are conducted in two environments. The first, a “double sphere” scenario with a spherical robot, a spherical object, and simple floor/wall constraints, demonstrates that constrained sampling alone enables the agent to discover sophisticated push‑and‑roll strategies and achieve near‑perfect success rates in reaching any statically stable configuration. The second, a more realistic Panda arm manipulation task where the arm must place a sphere onto various support surfaces, shows that Sample‑Guided RL (sampling + trajectory optimization) raises success from near‑zero (baseline) to over 30 %, while also exhibiting a variety of whole‑body contact strategies.
Key contributions are: (i) formalizing CG‑MDP to embed first‑principles physical constraints into the RL problem; (ii) presenting an efficient constrained state sampler based on Augmented Lagrangian projection; (iii) introducing zero‑order spline‑based trajectory optimization to generate feasible open‑loop demonstrations without requiring analytic dynamics; and (iv) empirically validating that biasing state initialization with physically feasible samples dramatically improves sample efficiency and policy robustness in sparse‑reward, contact‑rich tasks. The work opens avenues for integrating constraint‑aware curricula, scaling to multi‑object manipulation, and transferring the approach to real‑world robots where fast, reliable constrained sampling will be essential.
Comments & Academic Discussion
Loading comments...
Leave a Comment