From 2D Alignment to 3D Plausibility: Unifying Heterogeneous 2D Priors and Penetration-Free Diffusion for Occlusion-Robust Two-Hand Reconstruction

From 2D Alignment to 3D Plausibility: Unifying Heterogeneous 2D Priors and Penetration-Free Diffusion for Occlusion-Robust Two-Hand Reconstruction
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Two-hand reconstruction from monocular images is hampered by complex poses and severe occlusions, which often cause interaction misalignment and two-hand penetration. We address this by decoupling the problem into 2D structural alignment and 3D spatial interaction alignment, each handled by a tailored component. For 2D alignment, we pioneer the attempt to unify heterogeneous structural priors (keypoints, segmentation, and depth) from vision foundation models as complementary structured guidance for two-hand recovery. Instead of extracting priors prediction as explicit inputs, we propose a fusion-alignment encoder that absorbs their structural knowledge implicitly, achieving foundation-level guidance without foundation-level cost. For 3D spatial alignment, we propose a two-hand penetration-free diffusion model that learns a generative mapping from interpenetrated poses to realistic, collision-free configurations. Guided by collision gradients during denoising, the model converges toward the manifold of valid two-hand interactions, preserving geometric and kinematic coherence. This generative formulation approach enables physically credible reconstructions even under occlusion or ambiguous visual input. Extensive experiments on InterHand2.6M and HIC show state-of-the-art or leading performance in interaction alignment and penetration suppression. Project: https://gaogehan.github.io/A2P/


💡 Research Summary

Two‑hand 3D reconstruction from a single RGB image remains challenging due to ambiguous 2D‑3D correspondence, severe occlusions, and hand‑hand interpenetration. This paper tackles these issues by explicitly separating the problem into two complementary alignment stages—2D structural alignment and 3D spatial interaction alignment—each addressed with a purpose‑built module.

2D Structural Alignment. The authors leverage a human‑centric vision foundation model (Sapiens) that can predict three heterogeneous cues: 2D hand keypoints, segmentation masks, and depth maps. Instead of feeding these predictions directly into the reconstruction network (which would be computationally expensive and ambiguous), they introduce a lightweight Fusion Alignment Encoder (FAE). During training, the FAE is distilled from the foundation model’s latent outputs using an MSE loss, learning to produce a fused prior feature (F_p) that implicitly encodes all three cues. At inference time the foundation model is completely removed; only the FAE remains, providing multi‑modal guidance without any extra runtime cost. The fused prior is concatenated with image features and processed by a transformer encoder, after which a MANO‑based regressor predicts hand pose, shape, and global translation. The overall loss combines standard L1 hand parameter regression with a prior alignment term that forces the encoder’s output to match the distilled fused prior.

3D Spatial Interaction Alignment. Even with accurate 2D guidance, reconstructed hands can still intersect when one hand occludes the other. To resolve this, the paper proposes a penetration‑free diffusion model that learns a generative mapping from interpenetrated hand configurations to physically plausible, collision‑free ones. Interpenetrated inputs are generated either by a low‑performing baseline estimator or by adding noise to ground‑truth MANO parameters until penetration occurs. The diffusion process follows a DDIM schedule; at each reverse step the model receives the noisy hand state and the penetrated condition, and outputs a denoised estimate. Crucially, the authors augment the denoising with Collision Gradient Guidance: they compute Chamfer distances between the two hand meshes, select vertex pairs within a distance threshold, evaluate the cosine similarity of their normals, and collect a collision set. A robust Geman‑Moore‑Fisher (GMoF) loss is then back‑propagated to push intersecting vertices apart. An IoU‑based pre‑check skips diffusion when hands are already well separated, saving computation. The diffusion loss is an L2 term between the clean target and the model output, combined with the collision loss.

Experiments. The method is evaluated on InterHand2.6M, HIC, and FreiHAND. Metrics include 3D MPJPE, 2.5D joint error, and hand‑hand penetration rate. The proposed pipeline outperforms recent state‑of‑the‑art two‑hand systems such as 4DHands, InterHandGen, and BUDDI, achieving lower joint errors and reducing penetration by more than 70 % relative to baselines. Ablation studies confirm that (i) removing the FAE and using raw priors inflates FLOPs by >2× while slightly degrading accuracy, (ii) omitting collision guidance leads to a substantial rise in penetration, and (iii) the IoU pre‑filter effectively cuts inference time without harming quality.

Discussion and Limitations. The approach inherits the bias of the underlying foundation model; rare hand poses or unusual lighting may degrade prior quality. The diffusion stage, although powerful, still requires multiple denoising steps, limiting real‑time deployment (current frame rates are below 30 fps). Moreover, the current formulation handles only hand‑hand interaction; extending to hand‑object contact or multi‑person scenarios would need additional physical constraints and possibly new priors.

Conclusion. By unifying heterogeneous 2D priors through a distilled encoder and enforcing physically plausible 3D interactions via a penetration‑free diffusion model with collision gradient guidance, the paper delivers a robust, efficient solution for occlusion‑heavy two‑hand reconstruction. The results demonstrate that careful separation of 2D and 3D alignment, coupled with generative priors, can dramatically improve both geometric accuracy and interaction realism, opening avenues for AR/VR, robotics, and character animation applications.


Comments & Academic Discussion

Loading comments...

Leave a Comment