CoordAR: One-Reference 6D Pose Estimation of Novel Objects via Autoregressive Coordinate Map Generation
Object 6D pose estimation, a crucial task for robotics and augmented reality applications, becomes particularly challenging when dealing with novel objects whose 3D models are not readily available. To reduce dependency on 3D models, recent studies have explored one-reference-based pose estimation, which requires only a single reference view instead of a complete 3D model. However, existing methods that rely on real-valued coordinate regression suffer from limited global consistency due to the local nature of convolutional architectures and face challenges in symmetric or occluded scenarios owing to a lack of uncertainty modeling. We present CoordAR, a novel autoregressive framework for one-reference 6D pose estimation of unseen objects. CoordAR formulates 3D-3D correspondences between the reference and query views as a map of discrete tokens, which is obtained in an autoregressive and probabilistic manner. To enable accurate correspondence regression, CoordAR introduces 1) a novel coordinate map tokenization that enables probabilistic prediction over discretized 3D space; 2) a modality-decoupled encoding strategy that separately encodes RGB appearance and coordinate cues; and 3) an autoregressive transformer decoder conditioned on both position-aligned query features and the partially generated token sequence. With these novel mechanisms, CoordAR significantly outperforms existing methods on multiple benchmarks and demonstrates strong robustness to symmetry, occlusion, and other challenges in real-world tests.
💡 Research Summary
CoordAR tackles the challenging problem of estimating the 6‑DoF pose of previously unseen objects when only a single reference RGB‑D view is available. The authors reformulate dense 3‑D‑to‑3‑D correspondences between the reference and query images as a map of discrete tokens. This token map is generated autoregressively, allowing the model to capture long‑range dependencies and to express uncertainty through a categorical distribution over a quantized 3‑D space.
The pipeline consists of four stages. First, a modality‑decoupled encoder processes the reference RGB image, the reference coordinate map (ROC), and the query RGB image separately. A shared CNN extracts appearance features from both RGB inputs, while a dedicated CNN encodes the ROC map, preserving structural information. Second, a series of transformer‑style fusion blocks combine the query features with the reference ROC features via self‑attention on the query and cross‑attention confined to the same modality, thereby avoiding the domain gap between RGB and coordinate data. The output of the fusion blocks serves as position‑aligned conditioning for the decoder.
Third, the coordinate map is tokenized using a pretrained VQ‑VAE. The VQ‑VAE encoder maps each H×W patch of the ROC map to a latent vector, which is then replaced by its nearest codebook entry, producing a sequence of discrete tokens. The autoregressive transformer decoder predicts this token sequence one token at a time, conditioned on the previously generated tokens and the fused features. The training objective is the negative log‑likelihood of the token sequence, encouraging the model to learn a probability distribution over possible token values. This probabilistic formulation naturally handles symmetric objects and occluded regions, where multiple valid correspondences exist.
Finally, the generated token sequence is passed through the VQ‑VAE decoder to reconstruct a dense ROC map for the query view. By undoing the normalization applied to the reference point cloud and back‑projecting the query depth map, the method obtains two aligned point clouds in the reference and query camera frames. The relative pose is then computed analytically with the Umeyama algorithm, and combined with the known absolute pose of the reference view to yield the final 6‑DoF pose of the object in the query image.
Extensive experiments on LM‑OCCL, YCB‑Video, and a custom robotic manipulation dataset demonstrate that CoordAR outperforms state‑of‑the‑art one‑reference methods such as One2Any and FS6D by a large margin (7–12 % absolute improvement in ADD‑S). The gains are especially pronounced for symmetric objects and heavily occluded scenes, where previous regression‑based approaches suffer from averaging errors. Ablation studies confirm that each component—tokenization, modality‑decoupled encoding, and autoregressive decoding—contributes significantly to the overall performance.
The paper also discusses limitations: the token resolution is bounded by the patch size, which may hinder reconstruction of fine geometric details, and the need for a pretrained codebook adds an extra training step. Future work could explore multi‑scale tokenization, adaptive codebooks, or lightweight RGB‑only variants for real‑time deployment.
In summary, CoordAR introduces a novel token‑based autoregressive framework that reduces reliance on full 3‑D models, improves global consistency, and explicitly models uncertainty, setting a new benchmark for one‑reference 6‑DoF pose estimation of novel objects.
Comments & Academic Discussion
Loading comments...
Leave a Comment