Right-Side-Out: Learning Zero-Shot Sim-to-Real Garment Reversal
Turning garments right-side out is a challenging manipulation task: it is highly dynamic, entails rapid contact changes, and is subject to severe visual occlusion. We introduce Right-Side-Out, a zero-shot sim-to-real framework that effectively solves this challenge by exploiting task structures. We decompose the task into Drag/Fling to create and stabilize an access opening, followed by Insert&Pull to invert the garment. Each step uses a depth-inferred, keypoint-parameterized bimanual primitive that sharply reduces the action space while preserving robustness. Efficient data generation is enabled by our custom-built, high-fidelity, GPU-parallel Material Point Method (MPM) simulator that models thin-shell deformation and provides robust and efficient contact handling for batched rollouts. Built on the simulator, our fully automated pipeline scales data generation by randomizing garment geometry, material parameters, and viewpoints, producing depth, masks, and per-primitive keypoint labels without any human annotations. With a single depth camera, policies trained entirely in simulation deploy zero-shot on real hardware, achieving up to 81.3% success rate. By employing task decomposition and high fidelity simulation, our framework enables tackling highly dynamic, severely occluded tasks without laborious human demonstrations.
💡 Research Summary
The paper tackles the previously under‑explored problem of flipping a garment from its inside‑out state to a right‑side‑out configuration, a task that is highly dynamic, involves rapid contact changes, and suffers from severe visual occlusion. The authors introduce “Right‑Side‑Out”, a zero‑shot sim‑to‑real framework that solves this problem by exploiting three key ideas: (1) task decomposition, (2) keypoint‑parameterized bimanual primitives, and (3) a high‑fidelity GPU‑parallel Material Point Method (MPM) simulator for massive data generation.
Task decomposition splits the overall manipulation into two intuitive stages. The first stage, Drag‑Fling, creates an access opening by dragging a single‑layer patch from the garment’s collar toward the hem and then applying a brief fling to separate the front and back layers. The second stage, Insert&Pull, uses the second arm to insert its gripper into the opening, grasp the interior layer, and pull it through, thereby completing the flip. By breaking the problem into these sub‑goals, the authors reduce the horizon and simplify the contact topology that the policy must handle.
Each sub‑goal is realized as a bimanual primitive whose parameters are a small set of 2‑D image‑plane keypoints extracted from a single overhead depth image and a binary mask. The keypoints are fed into a lightweight U‑Net that predicts dense value maps; the pixel with the highest value is selected as the execution point. The selected keypoints are back‑projected into world coordinates and then used to drive a fixed motion template (e.g., a predefined drag distance, fling trajectory, or pull vector). This design dramatically shrinks the action space from continuous 6‑DoF trajectories to a handful of continuous keypoint coordinates, making learning tractable and the resulting policy interpretable.
To provide the massive, diverse training data required for such a policy, the authors build a custom MPM simulator that models thin‑shell cloth as codimensional particles coupled to an Eulerian grid via APIC transfers. The simulator incorporates an anisotropic elastoplastic constitutive model for cloth, a continuous Coulomb friction formulation for self‑collision, and a one‑way projected friction update for robot‑cloth contact. Because contact handling is performed directly on the grid, the method avoids costly pairwise collision queries and scales efficiently on GPUs. The simulator runs thousands of parallel environments, allowing the authors to randomize garment geometry, material parameters, initial poses, and camera viewpoints. For each rollout, depth images, masks, and ground‑truth keypoint labels for each primitive are generated automatically, eliminating any need for human annotation or tele‑operation data collection.
Training proceeds by feeding the synthetic depth‑mask pairs to three separate U‑Nets (one per primitive) and optimizing the value‑map predictions against the automatically generated keypoint labels. Domain randomization of visual and physical parameters ensures that the learned networks are robust to the sim‑real gap. At deployment time, the same networks process real depth images captured by a single RealSense D415 camera. The predicted keypoints are transformed to robot coordinates, and the corresponding primitives are executed on a dual‑arm 6‑DoF robot equipped with parallel grippers.
Experimental evaluation on a real robot platform demonstrates a success rate of up to 81.3 % across a variety of sleeveless tops differing in size, shape, and material. Success is defined as achieving at least 80 % right‑side‑out coverage in the final top‑down view, measured by a segmentation model (SAM) and a simple face‑polarity classifier. The approach requires no real‑world demonstrations, fine‑tuning, or additional sensors beyond the depth camera, highlighting the effectiveness of the simulation‑only training pipeline.
The contributions of the work are fourfold: (1) definition of the novel Right‑Side‑Out task, (2) introduction of compact, keypoint‑conditioned bimanual primitives that reduce the action space while preserving expressive power, (3) development of a high‑fidelity, GPU‑parallel MPM simulator with an automated data‑generation pipeline, and (4) demonstration of zero‑shot sim‑to‑real transfer for a highly dynamic, contact‑rich manipulation problem. The paper shows that careful task decomposition combined with physically accurate simulation can enable learning of complex cloth manipulation skills without labor‑intensive data collection, opening avenues for similar approaches in other dynamic, occlusion‑heavy robotic tasks.
Comments & Academic Discussion
Loading comments...
Leave a Comment