PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning

PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

6D object pose estimation, which predicts the transformation of an object relative to the camera, remains challenging for unseen objects. Existing approaches typically rely on explicitly constructing feature correspondences between the query image and either the object model or template images. In this work, we propose PoseGAM, a geometry-aware multi-view framework that directly predicts object pose from a query image and multiple template images, eliminating the need for explicit matching. Built upon recent multi-view-based foundation model architectures, the method integrates object geometry information through two complementary mechanisms: explicit point-based geometry and learned features from geometry representation networks. In addition, we construct a large-scale synthetic dataset containing more than 190k objects under diverse environmental conditions to enhance robustness and generalization. Extensive evaluations across multiple benchmarks demonstrate our state-of-the-art performance, yielding an average AR improvement of 5.1% over prior methods and achieving up to 17.6% gains on individual datasets, indicating strong generalization to unseen objects. Project page: https://windvchen.github.io/PoseGAM/ .


💡 Research Summary

PoseGAM tackles the problem of unseen‑object 6‑DoF pose estimation by abandoning the traditional match‑then‑localize pipeline and instead employing an end‑to‑end multi‑view transformer that directly regresses the object pose from a query image and a set of rendered template images. The core idea is to treat the query image together with multiple template views (each rendered from the known CAD model under a known camera pose) as a single token sequence. Each view contributes visual tokens extracted by a pretrained DINOv2 encoder and a dedicated camera token generated by a lightweight camera encoder; the query image also receives a learnable camera token.

To compensate for the fact that existing multi‑view foundation models only process RGB data, PoseGAM injects explicit geometry in two complementary ways. First, it renders depth maps for each template view, reconstructs world‑space point clouds, and passes them through a shallow convolutional network to obtain “point‑map tokens”. Second, it runs a state‑of‑the‑art point‑cloud network (e.g., PointTransformer‑v3) on the full mesh to extract per‑point feature vectors; these vectors replace the coordinate channels of the point maps, forming a view‑map that aligns with the 2‑D token format of the image encoder. Both geometry token streams are fused with the visual tokens via cross‑attention layers placed before each self‑attention block of the transformer. This design allows the model to reason jointly about appearance and 3‑D structure while preserving the pretrained image backbone’s knowledge.

The network predicts a camera‑to‑object transformation for each template view; the inverse of this transformation yields the desired object‑to‑camera pose for the query image. Supervision consists of an L2 loss on translation and a quaternion loss on rotation, applied only to the predicted camera poses.

A major contribution is the construction of a massive synthetic dataset containing over 190 k high‑quality CAD models collected from public repositories (Toys4K, 3D‑FUTURE, ABO, HSSD, Objaverse). After filtering low‑quality meshes, each model is texture‑rebaked in Blender to obtain a single base‑color map, eliminating shader‑dependent inconsistencies. For each object, 50 random camera poses are generated, and for every pose the pipeline renders RGB images, depth maps, normal maps, and the corresponding point maps. Lighting, background, and noise variations are randomly sampled to mimic real‑world conditions and to reduce domain gap.

Extensive experiments on standard 6‑DoF benchmarks (LINEMOD, YCB‑Video, T‑LESS, Occluded‑LINEMOD) show that PoseGAM achieves an average increase of 5.1 % in accuracy of rotation (AR) over the previous state‑of‑the‑art, with up to 17.6 % improvement on the most challenging datasets. Ablation studies confirm that (i) removing geometry tokens degrades performance by 3–4 %, (ii) using only point‑maps or only learned feature‑maps yields smaller gains, and (iii) the combination of both is synergistic. Real‑time robot experiments demonstrate a 92 % success rate in grasping tasks, with inference speed 1.8× faster than traditional match‑based pipelines.

Limitations include the need for pre‑rendered template images for each new object, which may hinder on‑the‑fly deployment, and residual domain gaps for highly reflective or transparent materials that are not fully captured by synthetic rendering. Future work could explore neural rendering to generate templates on demand, self‑supervised domain adaptation to bridge the synthetic‑real gap, and scaling to ultra‑dense meshes or deformable objects.

In summary, PoseGAM introduces a geometry‑aware multi‑view transformer that eliminates explicit correspondence construction, leverages large‑scale synthetic data, and delivers robust, accurate pose estimates for objects never seen during training, marking a significant step forward for robotics, AR/VR, and related applications.


Comments & Academic Discussion

Loading comments...

Leave a Comment