HOGraspFlow: Taxonomy-Aware Hand-Object Retargeting for Multi-Modal SE(3) Grasp Generation

HOGraspFlow: Taxonomy-Aware Hand-Object Retargeting for Multi-Modal SE(3) Grasp Generation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We propose Hand-Object\emph{(HO)GraspFlow}, an affordance-centric approach that retargets a single RGB with hand-object interaction (HOI) into multi-modal executable parallel jaw grasps without explicit geometric priors on target objects. Building on foundation models for hand reconstruction and vision, we synthesize $SE(3)$ grasp poses with denoising flow matching (FM), conditioned on the following three complementary cues: RGB foundation features as visual semantics, HOI contact reconstruction, and taxonomy-aware prior on grasp types. Our approach demonstrates high fidelity in grasp synthesis without explicit HOI contact input or object geometry, while maintaining strong contact and taxonomy recognition. Another controlled comparison shows that \emph{HOGraspFlow} consistently outperforms diffusion-based variants (\emph{HOGraspDiff}), achieving high distributional fidelity and more stable optimization in $SE(3)$. We demonstrate a reliable, object-agnostic grasp synthesis from human demonstrations in real-world experiments, where an average success rate of over $83%$ is achieved. Code: https://github.com/YitianShi/HOGraspFlow


💡 Research Summary

HOGraspFlow tackles the problem of retargeting human hand‑object interactions (HOI) to parallel‑jaw (PJ) robot grippers without relying on explicit 3D object models. The authors build a vision‑centric pipeline that extracts three complementary cues from a single RGB crop of a hand‑object scene: (1) visual semantics from a foundation vision model (DINOv2), (2) a contact map predicted from a monocular hand reconstructor (WiLoR) that provides MANO pose and shape parameters, and (3) a grasp‑type prior derived from a 33‑class taxonomy (GRASP TAXONOMY) encoded in a learnable codebook. These cues are fused via a self‑attention module into a compact HOI‑aware descriptor, which serves as conditioning input for a generative model operating directly on the SE(3) manifold.

Two generative frameworks are explored: HOGraspDiff, a score‑matching diffusion model adapted to SE(3), and HOGraspFlow, a flow‑matching (deterministic ODE) model that learns a left‑trivialized velocity field on se(3). Both models generate diverse 6‑DoF PJ grasp poses conditioned on the descriptor. Crucially, during sampling the contact map and taxonomy prior are used as differentiable guidance signals, steering the generation toward physically feasible, force‑closure grasps that respect the inferred human intent.

Training involves supervised contact prediction (weighted binary cross‑entropy) and grasp‑type classification (cross‑entropy), while the codebook provides a soft prior that mitigates classification errors. The generation network follows the Diffusion Transformer (DiT) architecture, with custom layers for Lie‑group operations. At inference, depth information is only employed for a Z‑only ICP refinement of the hand wrist frame, avoiding any need for full object meshes or pose estimation.

Extensive experiments demonstrate that HOGraspFlow outperforms the diffusion baseline in distributional fidelity (lower KL divergence), contact accuracy (higher IoU), and grasp‑type prediction (≈92% accuracy). Real‑world robot trials on a UR5e equipped with a parallel‑jaw gripper achieve an average success rate above 83% across a variety of everyday objects, despite the absence of explicit object geometry. The flow‑matching approach yields more stable optimization and faster convergence compared to score‑matching diffusion, highlighting its suitability for SE(3) generation tasks.

The paper contributes (i) a taxonomy‑aware, contact‑driven HOI embedding that captures human grasp intent without object models, (ii) a deterministic flow‑matching generative framework on SE(3) that preserves multi‑modality and enables in‑loop guidance, and (iii) a practical demonstration that object‑agnostic grasp synthesis from single‑view human demonstrations can be reliably deployed on real robots. Limitations include reliance on accurate hand detection and potential translation errors for extremely shallow objects; future work may extend the method to multi‑fingered grippers and more complex manipulation primitives. Overall, HOGraspFlow represents a significant step toward scalable, vision‑only robot grasp synthesis grounded in human intent.


Comments & Academic Discussion

Loading comments...

Leave a Comment