Active 6D Pose Estimation for Textureless Objects using Multi-View RGB Frames
Estimating the 6D pose of textureless objects from RGB images is an important problem in robotics. Due to appearance ambiguities, rotational symmetries, and severe occlusions, single-view based 6D pose estimators are still unable to handle a wide range of objects, motivating research towards multi-view pose estimation and next-best-view prediction that addresses these limitations. In this work, we propose a comprehensive active perception framework for estimating the 6D poses of textureless objects using only RGB images. Our approach is built upon a key idea: decoupling the 6D pose estimation into a two-step sequential process can greatly improve both accuracy and efficiency. First, we estimate the 3D translation of each object, resolving scale and depth ambiguities inherent to RGB images. These estimates are then used to simplify the subsequent task of determining the 3D orientation, which we achieve through canonical scale template matching. Building on this formulation, we then introduce an active perception strategy that predicts the next best camera viewpoint to capture an RGB image, effectively reducing object pose uncertainty and enhancing pose accuracy. We evaluate our method on the public ROBI and TOD datasets, as well as on our reconstructed transparent object dataset, T-ROBI. Under the same camera viewpoints, our multi-view pose estimation significantly outperforms state-of-the-art approaches. Furthermore, by leveraging our next-best-view strategy, our approach achieves high pose accuracy with fewer viewpoints than heuristic-based policies across all evaluated datasets. The accompanying video and T-ROBI dataset will be released on our project page: https://trailab.github.io/ActiveODPE.
💡 Research Summary
The paper tackles the challenging problem of estimating the full 6‑degree‑of‑freedom (6D) pose of texture‑less objects using only RGB images. Recognizing that single‑view RGB methods suffer from scale, depth, and rotational‑symmetry ambiguities, the authors propose a comprehensive active perception framework that combines a two‑stage multi‑view pose estimator with an information‑theoretic next‑best‑view (NBV) planner.
In the first stage, the system aggregates multiple RGB frames with known camera extrinsics to recover the 3D translation of each object. By jointly solving a linear system that aligns 2D detections with the known 3D CAD model, the method resolves the inherent scale‑depth ambiguity that plagues monocular approaches. Robustness is enhanced through RANSAC‑style outlier rejection, yielding accurate global positions even under occlusion.
The second stage fixes the estimated translation and focuses on orientation. A canonical‑scale template of the object (rendered depth and edge maps) is matched against per‑frame edge predictions produced by a dedicated network head. The matching maximizes correlation while explicitly handling object symmetries: all symmetry‑equivalent rotations are enumerated, and the one minimizing the reprojection error is selected. This decoupling simplifies the rotation search and improves convergence compared with end‑to‑end regressors.
To reduce the number of required views, the authors introduce an NBV strategy based on pose‑uncertainty entropy. After each observation, the current pose distribution’s entropy is computed. For each candidate camera pose, a virtual measurement is simulated, and the expected entropy reduction is estimated. The view that maximally reduces uncertainty is chosen for the next capture. This principled selection outperforms heuristic baselines (e.g., random, maximal visible area) by achieving comparable or higher accuracy with 30‑40 % fewer images.
The approach is evaluated on three datasets: the public ROBI dataset, the TOD dataset (which includes transparent and reflective objects), and a newly released T‑ROBI dataset that focuses on transparent parts in cluttered bins. Additionally, a large synthetic dataset derived from ROBI and T‑ROBI is used for training. Results show that, with the same number of views, the proposed multi‑view estimator surpasses state‑of‑the‑art RGB‑only methods such as CosyPose and PVNet, and it matches or exceeds depth‑based methods on reflective objects while dramatically outperforming them on transparent items. When combined with the NBV planner, the system reaches high pose accuracy using far fewer viewpoints than baseline view‑selection policies.
Key contributions are: (1) a novel two‑step 6D pose estimation pipeline that isolates translation to resolve depth ambiguity and then refines orientation via canonical template matching; (2) explicit handling of rotational symmetries and use of per‑frame edge maps to improve orientation robustness; (3) an entropy‑driven NBV algorithm that actively selects the most informative camera pose; (4) the introduction of the T‑ROBI transparent‑object dataset and a large synthetic training set; and (5) extensive real‑world experiments demonstrating superior performance on both opaque and transparent objects.
Overall, the work demonstrates that RGB‑only active perception can achieve reliable 6D pose estimation for texture‑less and even transparent objects, opening new possibilities for robotic manipulation, bin picking, and industrial automation where depth sensors are unreliable or unavailable.
Comments & Academic Discussion
Loading comments...
Leave a Comment