Demo-Pose: Depth-Monocular Modality Fusion For Object Pose Estimation

Object pose estimation is a fundamental task in 3D vision with applications in robotics, AR/VR, and scene understanding. We address the challenge of category-level 9-DoF pose estimation (6D pose + 3Dsize) from RGB-D input, without relying on CAD mode…

Authors: Rachit Agarwal, Abhishek Joshi, Sathish Chalasani

Demo-Pose: Depth-Monocular Modality Fusion For Object Pose Estimation
DEMO-POSE: DEPTH-MONOCULAR MOD ALITY FUSION FOR OBJECT POSE ESTIMA TION Rachit Agarwal ⋆ ‡ Abhishek J oshi ⋆ ‡ Sathish Chalasani ⋆ W oo Jin Kim † ⋆ Samsung R&D Institute, Bangalore † Samsung Electronics Suwon, Republic of K orea ABSTRA CT Object pose estimation is a fundamental task in 3D vision with ap- plications in robotics, AR/VR, and scene understanding. W e address the challenge of category-lev el 9-DoF pose estimation (6D pose + 3D size) from RGB-D input, without relying on CAD models during inference. Existing depth-only methods achiev e strong results but ignore semantic cues from RGB, while many RGB-D fusion mod- els underperform due to suboptimal cross-modal fusion that fails to align semantic RGB cues with 3D geometric representations. W e propose DeMo-Pose, a hybrid architecture that fuses monocular se- mantic features with depth-based graph con volutional representa- tions via a novel multi-modal fusion strategy . T o further improv e geometric reasoning, we introduce a novel Mesh-Point Loss (MPL) that le verages mesh structure during training without adding infer- ence ov erhead. Our approach achiev es real-time inference and sig- nificantly improves ov er state-of-the-art methods across object cat- egories, outperforming the strong GPV -Pose baseline by 3.2% on 3D IoU and 11.1% on pose accuracy on the REAL275 benchmark. The results highlight the ef fectiv eness of depth–RGB fusion and geometry-aware learning, enabling robust category-lev el 3D pose es- timation for real-world applications. Index T erms — Object pose estimation, 3D vision, multi-modal fusion, point cloud, depth sensing 1. INTR ODUCTION Accurate estimation of the 9-DoF object pose —comprising 3D po- sition, orientation, and absolute size—is a fundamental problem in computer vision. Robust pose prediction enables critical applica- tions in r obotic manipulation, autonomous navigation, and AR/VR scene understanding , where reliable 3D reasoning is essential. 3D pose estimation is widely used in AR/VR with HMDs, smartphones, and gaming. Howe ver , inaccurate 9-DoF pose (rotation, transla- tion, size) degrades AR user experience and poses risks in critical tasks like autonomous navigation, highlighting the need for robust pose estimation, as shown in Fig. 1. Pose estimation remains an ac- tiv e research domain in the computer vision community , with recent methods that can reliably estimate the pose e ven under sev ere occlu- sion [1]. While significant progress has been made in 6D pose esti- mation , most existing approaches remain limited to instance-level settings , often requiring precise CAD models at inference. Such methods fail to generalize to category-le vel pose estimation , where unseen objects from kno wn categories must be localized and scaled. Recent works ha ve explored this problem, b ut depth-only ap- proaches typically outperform RGB-D methods , indicating that current RGB–Depth fusion strategies are suboptimal. ‡ Authors with equal contributions, Samsung R&D Institute Bangalore Fig. 1 . Impact of inaccurate rotation, scale, and translation when ov erlaying a virtual k eyboard template on a ph ysical ke yboard in MR. The bottom image sho ws correct alignment using pose estima- tion for improv ed user experience, as seen through the HMD [2] Instance-level pose estimation: Significant progress has been made in 6D instance-lev el pose estimation , where methods pre- dict the pose of kno wn objects with CAD models av ailable. Ex- isting approaches include direct regression [3, 4], learning latent embedding for pose retriev al [5], correspondence-based methods with Perspective-n-Point (PnP) solvers [6], and fusion-based tech- niques [7]. Despite strong performance, these methods are limited in practice: they require precise CAD models and typically handle only a small set of object instances. Category-level pose estimation: Category-lev el methods aim to generalize pose estimation to unseen objects from known cate- gories, without relying on CAD models. Early works introduced canonical spaces such as NOCS [8] and its extensions [9, 10], while later approaches incorporated geometric priors or dual networks [11, 12]. Depth-only methods hav e sho wn superior accuracy compared to RGB-D fusion [13], highlighting a gap in leveraging semantic cues from RGB images. Recent self-supervised strategies [14, 15] reduce annotation cost b ut still lag behind supervised baselines in accuracy and efficienc y . In summary , prior work either sacrifices generalization (instance- lev el) or fails to exploit RGB–Depth complementarity effecti vely (category-le vel). Our work addresses this gap by fusing monoc- ular RGB features with depth-based repr esentations , coupled with a nov el geometry-aware loss, to achieve robust and real-time category-le vel 9-DoF pose estimation. Fig. 2 . Fusion module architecture to le verage RGB features obtained from Monocular detection model and fuse with Depth-based GCN features. In inference, we achiev e real-time performance, suitable for device deployment. W e provide more details in proposed method sec.2 In this work, we introduce DeMo-Pose , a hybrid framework for category-le vel 9-DoF object pose estimation that overcomes these limitations. Our model fuses monocular RGB features with depth- based graph con volutional representations through a nov el multi- modal fusion scheme. Furthermore, we propose a Mesh-Point Loss (MPL) that le verages 3D mesh structure during training to impro ve geometric reasoning without adding inference ov erhead. Our key contributions are as follo ws. • A hybrid fusion architecture that integrates monocular se- mantic cues with depth features for pose estimation. • A geometry-aware training objective (Mesh Point Loss, MPL) that enhances category-le vel generalization. • A real-time system (frames per second, FPS ≈ 18) , achiev- ing significant improvements over state-of-the-art methods, with 3.2% gain in 3D-IoU and 11.1% gain in pose accu- racy on the REAL275 benchmark. These contributions highlight the potential of RGB–Depth fu- sion and geometry-aware learning for advancing category-le vel 3D pose estimation in practical, real-world en vironments. 2. PR OPOSED METHOD W e propose DeMo-Pose , a novel hybrid architecture that fuses se- mantic information deri ved from RGB input with features obtained from depth-based pose estimation model. The overall pipeline is shown in Fig. 2. Giv en an RGB image as input, we propose a single-stage detector architecture that can predict projected 3D- keypoints and relativ e size of objects. The details are provided in Section 2.1. The network learns pose-rich semantic and object- related cues through training in an end-to-end manner . W e le verage these important spatial features as described in Section 2.2 to accu- rately predict and improve the depth-based pose estimation model. T o further improv e geometry reasoning, we introduce in Section 2.3, a Mesh-Point Loss (MPL) that leverages mesh structure during training without increasing inference cost. 2.1. Monocular Detection Giv en an RGB input, our monocular method targets category-le vel 3D object pose estimation. Unlike approaches such as Center- Pose [16], which employ separate networks per category , we adopt a generic single-stage detector scalable across object classes. Fol- lowing prior works [3, 11], we predict 2D projections of 3D cuboid corners and apply a Perspecti ve-n-Point (PnP) algorithm to recover object pose. W e lev erage FCOS [17] backbone and pyramid fea- tures to predict 2D keypoints, class labels, and relative object size. T raining explicitly exploits RGB semantics to learn rich pose cues by predicting relativ e object dimensions instead of absolute depth, thereby addressing the ill-posed nature of monocular depth estima- tion [16]. T o av oid this, we estimate the relative dimensions that serve as rob ust auxiliary cues for downstream fusion. Monocular Architecture: As shown in Fig. 2, the monocu- lar network takes RGB images as input. It consists of a Backbone, Path Aggregation Network (P AN), and three Monocular Prediction Heads. W e use GhostNet [18] as the backbone, where the Ghost module efficiently generates feature maps through inexpensiv e lin- ear transformations, making it a strong choice for embedded de vices with limited resources due to its balance of accuracy and efficienc y . The multi-lev el backbone features are passed through a P AN, which enhances information flow via bottom-up augmentation, reducing the path between low- and high-le vel features. Follo wing P AN, three prediction heads jointly perform: i) re- gression of 8 projected 2D keypoints of a cuboid enclosing the ob- ject, ii) classification of the object, and iii) regression of its relative dimensions. Inspired by FCOS [17] , these heads share stages for ef- ficient learning. The network is trained end-to-end with GIoU loss, quality focal loss, and distributional focal loss [19], commonly used in 2D detection and adapted here for monocular 3D detection, where estimated cuboid keypoints also act as projected 2D points. Once the monocular model is trained, we freeze it and use the semantically rich P AN features as the Monocular Feature Map, for fusion to build the hybrid model. The second training phase of the hybrid architecture is detailed in the next section. Fig. 3 . (a) Comparison of predictions across video frames : GPV -Pose exhibits temporal instability for the laptop category (top ro w), while our DeMo-Pose fusion yields stable predictions (bottom ro w). (b) W e represent predictions with green colored boxes and ground truth with black boxes , our method (bottom row) produces tighter and more accurate box es for laptop and mug compared to GPV -Pose (top ro w). 2.2. Depth-based Backbone and Fusion Depth information is essential for 9-DOF pose estimation. Recently , 3D graph con volution (3DGC) [20] has gained popularity due to its robustness to point cloud shift and scale. GPV -Pose [13] em- ploys 3DGC as a backbone to extract global and local features, enabling confidence-driven closed-form 3D rotation recovery and class-specific geometric characterization. It achieves ≈ 20 FPS on standard benchmarks, making it an effecti ve depth-based baseline for our Depth–Monocular fusion approach. Follo wing [13], we preprocess depth maps with Mask-RCNN [21] to segment objects, back-project to 3D, and sample 1028 points as GPV -Pose input. The 3DGC extracts global and per-point features, which are fed to regression heads to predict pose { r, t, s } (Rotation, T ranslation, Size). Ho wev er, depth-only features are sensitiv e to noise, occlusion, sampling, and segmentation. In contrast, analo- gous to scene-understanding tasks, RGB images provide contextual and semantic cues (e.g., occlusion and background information) use- ful for pose estimation. Hence, we fuse RGB and depth modalities to exploit their complementary strengths. As sho wn in Fig. 2, monocular features (H×W×C 1 ) and depth- based features (N×C 2 ) lie in different spaces. W e align them via a feature sampling module: using indices of N sampled point cloud points, their spatial locations are bilinearly interpolated on the monocular feature map to form an N×C 1 tensor . This makes features dimensionally compatible with depth-based features for fusion. While we adopt concatenation for simplicity , our approach generalizes to other fusion strategies such as addition, Hadamard multiplication, or MLP-based fusion. 2.3. Geometry-A war e Mesh-Point Loss (MPL) T o further regularize pose estimation, we introduce the Mesh-Point Loss (MPL) . During training, a subset of vertices is sampled from the ground-truth object mesh using Poisson disk sampling [22]. The network predicts corresponding mesh points, and the L2 distance between predicted and ground-truth vertices is minimized: L M P L = 1 V V X i =1 ∥ R · M GT i − M pred i ∥ 2 , (1) where M GT i and M pred i are ground-truth and predicted vertices, and R is the ground-truth rotation. Our MPL is inspired from PoseLoss proposed in PoseCNN [23]. PoseLoss deals with rotation matrix computed from estimated quaternion, whereas, we directly regress the 3D mesh vertices. Un- like PoseLoss, MPL can handle object symmetries since it directly supervises 3D point distributions rather than rotation parameters. Thus, MPL ov ercomes the inherent issues in PoseLoss. Since mesh predictions are only required during training, MPL introduces no ov erhead at inference. The total loss combines L base with MPL: L total = L base + λ M P L ∗ L M P L (2) where λ M P L is a hyperparameter , and, L base denotes the standard GPV -Pose regression loss supervising rotation, translation and scale. The proposed architecture achieves three desirable properties: (i) effecti ve fusion of semantic RGB and depth-based geometry , (ii) geometry-aware training via MPL, and (iii) ef ficient inference, as monocular features can also be used standalone in resource- constrained scenarios. 3. EXPERIMENTS AND RESUL TS 3.1. Implementation Details T o ensure fair benchmarking, we train our method purely on real data, following [5, 11, 13], and generate instance masks using Mask- RCNN [21]. Adopting the GPV -Pose setup, 1028 points are uni- formly sampled from the back-projected depth map as input to the Depth-based model. Multi-modal fused features are trained with ex- isting losses using default hyperparameters. For Mesh-Point Loss (MPL), we introduce a scalar weight λ MPL to balance gradient mag- nitudes, empirically set to 2000. Training is conducted in PyT orch on the REAL275 dataset with a single model across all categories. Details of hyperparameter selection experiments will be pro vided in the appendix. For consistency , we adopt the same hardware settings as reported in GPV -Pose. 3.2. Dataset and Evaluation Metrics W e evaluate DeMo-Pose on the widely used REAL275 bench- mark [8], which consists of challenging 13 real-world scenes cover- ing six object categories. i.e. bottle, bowl, camera, can, laptop, and mug . Follo wing standard protocol, 7 scenes ( ≈ 4.3k images) are used for training and 6 scenes ( ≈ 2.7k images) for testing. The proposed model leads to superior results when benchmarked against prior pose estimation models as shown in T able 1 and present subjectiv e comparison in Fig. 3. W e follo w prior works [11, 13] and report: (i) 3D IoU at 25%, 50% and 75% thresholds (3D 25 , 3D 50 , 3D 75 ) to e valuate joint translation, rotation, and size accuracy , and (ii) pose accuracy un- der combined rotation and translation thresholds, including 5°2cm, 5°5cm, 10 °5cm, and 10°10cm, and (iii) frame rate (FPS) for each method. Our method achieves efficient inference, follo wing the GPV -Pose implementation settings. Method Setting 3D 25 ↑ 3D 50 ↑ 3D 75 ↑ 5°2cm ↑ 5°5cm ↑ 10°5cm ↑ 10°10cm ↑ FPS NOCS [8] CVPR’19 RGB-D 84.9 80.5 30.1 7.2 10.0 25.2 26.7 5 CASS [9] CVPR’20 RGB-D 84.2 77.7 - - 23.5 58.0 58.3 - SPD [5] ECCV’20 RGB-D 83.4 77.3 53.2 19.3 21.4 54.1 - 4 CR-Net [24] IR OS’21 RGB-D - 79.3 55.9 27.8 34.3 60.8 - - SGP A [11] ICCV’21 RGB-D - 80.1 61.9 35.9 39.6 70.7 - - DualPoseNet [12] ICCV’21 RGB-D - 79.8 62.2 29.3 35.9 66.8 - 2 DO-Net [10] Arxiv’21 D - 80.4 63.7 24.1 34.8 67.4 - 10 FS-Net [25] CVPR’21 D - - - - 28.2 60.8 64.6 20 GPV -Pose [13] CVPR’22 D 84.2 83.0 64.4 32.0 42.9 73.3 74.6 20 SSC-6D [14] AAAI’22 RGB-D 83.2 73.0 - - 19.6 54.5 64.4 - FSD [15] ICRA ’24 RGB-D 80.9 77.4 - - 28.1 61.5 72.6 - DiffusionNOCS [26] IR OS’24 RGB-D - - - - 35.0 66.6 - - A CR-Pose-PN2 [27] ICMR’24 RGB-D - 82.3 66.6 36.7 41.3 56.7 67.0 - GS-Pose [28] ECCV’24 RGB-D 82.1 63.2 - - 28.8 - 60.6 - DeMo-Pose (Ours) RGB-D 84.2 83.0 66.8 34.7 47.7 79.3 80.6 17.86 T able 1 . Comparison with state-of-the-art methods on the REAL275 dataset. Bold indicates superior results. ↑ denotes higher is better for each metric. Method 3D 75 5°2cm 5°5cm 10°5cm DeMo-Pose w/o MPL 64.1 33.6 46.3 78.8 DeMo-Pose + MPL 66.8 34.7 47.7 79.3 T able 2 . Ablation study on efficac y of our Mesh Point Loss (MPL) on REAL275 test set Fusion Methods 3D 50 3D 75 5°5cm 10°5cm FPS Concatenation 83.0 66.8 47.7 79.3 17.86 MLP + Skip 82.8 66.2 46.0 77.6 17.41 Attention + Skip 82.9 66.9 46.8 77.9 17.36 T able 3 . Ablation study on RGB–D fusion strategies 3.3. Results W e compare our method with state-of-the-art models on REAL275 dataset. T able 1 compares DeMo-Pose with representative instance and category-le vel baselines. Our method achieves consistent im- prov ements across most metrics (i.e., 5 out of 6 metrics). Notably , DeMo-Pose surpasses the strong depth-only baseline GPV -Pose by 3.2% on 3D75 IoU and 11.1% on 5°5cm pose accuracy . Further- more, for the 10°5cm, 10°10cm metrics , we surpass the prior art by a relati ve increase of 8.1% and 7.4% , whilst running almost in real time, FPS ≈ 18 . These results v alidate the benefit of fusing semantic RGB cues with depth-based geometric features. In the appendix, we will provide a detailed comparison of the per -category results of our method on REAL275. For better understanding, we provide qualitative comparisons (not shown for brevity), indicating that our fusion strate gy produces more stable predictions, compared to GPV -Pose. In Fig. 3(a), for a giv en sequence of frames in a video, it is observed that the pre- dictions flicker for the laptop category in GPV -Pose, while with our DeMo-Pose fusion approach, the predictions exhibit improved temporal consistency and stability . Furthermore, Fig. 3(b) shows our method produces tighter and more accurate boxes for laptop and mug. Overall, the results highlight that: (i) RGB cues provide complementary semantics missing in depth-only pipelines, (ii) MPL improv es geometry awareness and robustness, and (iii) the hybrid model achieves real-time inference, making it suitable for deploy- ment in AR/VR and robotics. Next, we provide ablation studies on nov el Mesh-Point Loss and RGB-D fusion strategies, with objec- tiv e comparison results — strongly confirming the efficacy of our proposed approach. 3.4. Ablation on Mesh-Point Loss (MPL) T o understand the ef ficacy of the proposed Mesh-Point loss (MPL), we analyze the performance through an ablation study . In T able 2, we refer to our baseline model i.e. Depth-Monocular based fusion as DeMo-Pose. Adding MPL consistently improves performance across all metrics, demonstrating that explicitly supervising geome- try strengthens category-le vel pose prediction. 3.5. Ablation on RGB-D fusion strategies T o v alidate the contrib ution of the RGB–D fusion module, we con- ducted an ablation study on different fusion mechanisms. Specifi- cally , we evaluate three strategies: (i) Concatenation , where RGB and depth features are directly concatenated; (ii) MLP-based fu- sion with skip connection , where features are projected into a com- mon space and combined through a multilayer perceptron; and (iii) Attention-based fusion with skip connection , where cross-modal attention is used to adapti vely weigh RGB and depth features. The results are summarized in T able 3. The results demonstrate that sim- ple concatenation remains the most ef fectiv e strategy for RGB-D in- tegration in the proposed frame work, without introducing any addi- tional complexity . 4. CONCLUSION W e presented DeMo-Pose , a hybrid framew ork for category-lev el 9-DoF object pose estimation that fuses semantic RGB features with depth-based geometric representations. By introducing a novel Mesh-Point Loss (MPL) , our method strengthens geometry aware- ness during training without adding inference ov erhead. Extensiv e experiments on the REAL275 benchmark demon- strate that DeMo-Pose achiev es state-of-the-art performance , surpassing strong depth-only baselines by 3.2% on 3D IoU and 11.1% on pose accuracy , while maintaining efficient inference. These results highlight the effecti veness of multi-modal fusion and geometry-aware training for rob ust 3D vision. In future work, we plan to extend DeMo-Pose towards genera- tiv e framew orks and large-scale datasets for holistic 3D scene under- standing, enabling broader applications in AR/VR and robotics. 5. REFERENCES [1] Y an Di, Fabian Manhardt, Gu W ang, Xiangyang Ji, Nassir Nav ab, and Federico T ombari, “So-pose: Exploiting self- occlusion for direct 6d pose estimation, ” in Proceedings of the IEEE/CVF International Confer ence on Computer V ision (ICCV) , October 2021, pp. 12396–12405. [2] Meta Quest Pro, “https://www .meta.com/quest/quest-pro/, ” 2023. [3] W adim K ehl, Fabian Manhardt, Federico T ombari, and Slobo- dan Ilic, “Ssd-6d: Making rgb-based 3d detection and 6d pose estimation great again, ” in Pr oceedings of the IEEE interna- tional confer ence on computer vision , 2017, pp. 1521–1529. [4] Y inlin Hu, Pascal Fua, W ei W ang, and Mathieu Salzmann, “Single-stage 6d object pose estimation, ” in Proceedings of the IEEE/CVF confer ence on CVPR , 2020, pp. 2930–2939. [5] Meng Tian, Marcelo H Ang, and Gim Hee Lee, “Shape prior deformation for categorical 6d object pose and size estima- tion, ” in Computer V ision–ECCV 2020: 16th Eur opean Con- fer ence, Glasgow , UK, August 23–28, 2020, Proceedings, P art XXI 16 . Springer , 2020, pp. 530–546. [6] Kiru Park, T imothy Patten, and Markus V incze, “Pix2pose: Pixel-wise coordinate re gression of objects for 6d pose estima- tion, ” in Pr oceedings of the IEEE/CVF international confer- ence on computer vision , 2019, pp. 7668–7677. [7] Y isheng He, Haibin Huang, Haoqiang Fan, Qifeng Chen, and Jian Sun, “Ffb6d: A full flow bidirectional fusion network for 6d pose estimation, ” in Pr oceedings of the IEEE/CVF confer- ence on CVPR , 2021, pp. 3003–3013. [8] He W ang, Srinath Sridhar, Jingwei Huang, Julien V alentin, Shuran Song, and Leonidas J Guibas, “Normalized object co- ordinate space for category-lev el 6d object pose and size esti- mation, ” in Pr oceedings of the IEEE/CVF Conference on Com- puter V ision and P attern Recognition , 2019, pp. 2642–2651. [9] Dengsheng Chen, Jun Li, Zheng W ang, and Kai Xu, “Learning canonical shape space for category-lev el 6d object pose and size estimation, ” in Pr oceedings of the IEEE/CVF conference on CVPR , 2020, pp. 11973–11982. [10] Haitao Lin, Zichang Liu, Chilam Cheang, Lingwei Zhang, Y anwei Fu, and Xiangyang Xue, “Donet: Learning category- lev el 6d object pose and size estimation from depth observa- tion, ” arXiv pr eprint arXiv:2106.14193 , vol. 2, no. 4, 2021. [11] Kai Chen and Qi Dou, “Sgpa: Structure-guided prior adapta- tion for category-le vel 6d object pose estimation, ” in Pr oceed- ings of the IEEE/CVF ICCV , 2021, pp. 2773–2782. [12] Jiehong Lin, Zewei W ei, Zhihao Li, Songcen Xu, Kui Jia, and Y uanqing Li, “Dualposenet: Category-lev el 6d object pose and size estimation using dual pose network with refined learning of pose consistenc y , ” in Proceedings of the IEEE/CVF Interna- tional Confer ence on Computer V ision , 2021, pp. 3560–3569. [13] Y an Di et al., “Gpv-pose: Category-le vel object pose estima- tion via geometry-guided point-wise voting, ” in Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , 2022, pp. 6781–6791. [14] W anli Peng, Jianhang Y an, Hongtao W en, and Y i Sun, “Self- supervised category-lev el 6d object pose estimation with deep implicit shape representation, ” in Proceedings of the AAAI Confer ence on Artificial Intelligence , 2022, pp. 2082–2090. [15] Mayank Lunayach, Sergey Zakharov , Dian Chen, Rares Am- brus, Zsolt Kira, and Muhammad Zubair Irshad, “Fsd: Fast self-supervised single rgb-d to cate gorical 3d objects, ” in Int. Conf. on Robotics and A utomation . IEEE, 2024. [16] Y unzhi Lin et al., “Single-stage keypoint-based category-le vel object pose estimation from an rgb image, ” in 2022 Interna- tional Conference on Robotics and A utomation (ICRA) . IEEE, 2022, pp. 1547–1553. [17] Zhi Tian et al., “Fcos: Fully con volutional one-stage object detection, ” in Pr oceedings of the IEEE/CVF international con- fer ence on computer vision , 2019, pp. 9627–9636. [18] Kai Han, Y unhe W ang, Qi Tian, Jianyuan Guo, Chunjing Xu, and Chang Xu, “Ghostnet: More features from cheap opera- tions, ” 2020 IEEE/CVF Conference on Computer V ision and P attern Recognition (CVPR) , pp. 1577–1586, 2019. [19] Xiang Li, W enhai W ang, Lijun W u, Shuo Chen, Xiaolin Hu, Jun Li, Jinhui T ang, and Jian Y ang, “Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection, ” Advances in Neur al Information Processing Systems , vol. 33, pp. 21002–21012, 2020. [20] Zhi-Hao Lin, Sheng-Y u Huang, and Y u-Chiang Frank W ang, “Con volution in the cloud: Learning deformable kernels in 3d graph con volution networks for point cloud analysis, ” in Pr o- ceedings of the IEEE/CVF confer ence on computer vision and pattern r ecognition , 2020, pp. 1800–1809. [21] Kaiming He, Georgia Gkioxari, Piotr Doll ´ ar , and Ross Gir- shick, “Mask r-cnn, ” in Pr oceedings of the IEEE international confer ence on computer vision , 2017, pp. 2961–2969. [22] Cem Y uksel, “Sample elimination for generating poisson disk sample sets, ” in Computer Graphics F orum . W iley Online Li- brary , 2015, vol. 34, pp. 25–32. [23] Y u Xiang, T anner Schmidt, V enkatraman Narayanan, and Di- eter Fox, “Posecnn: A con volutional neural network for 6d object pose estimation in cluttered scenes, ” arXiv pr eprint arXiv:1711.00199 , 2017. [24] Jiaze W ang, Kai Chen, and Qi Dou, “Category-le vel 6d object pose estimation via cascaded relation and recurrent reconstruc- tion networks, ” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems . IEEE, 2021, pp. 4807–4814. [25] W ei et al. Chen, “Fs-net: Fast shape-based network for category-le vel 6d object pose estimation with decoupled rota- tion mechanism, ” in Pr oceedings of the IEEE/CVF Confer ence on CVPR , 2021, pp. 1581–1590. [26] T akuya Ikeda, Serge y Zakharov , T ianyi K o, Muham- mad Zubair Irshad, Robert Lee, Katherine Liu, Rares Ambrus, and Koichi Nishiwaki, “Diffusionnocs: Managing symmetry and uncertainty in sim2real multi-modal category-lev el pose estimation, ” in 2024 IEEE/RSJ International Confer ence on Intelligent Robots and Systems . IEEE, 2024, pp. 7406–7413. [27] Zhaoxin Fan, Zhenbo Song, Zhicheng W ang, Jian Xu, Kejian W u, Hongyan Liu, and Jun He, “ Acr-pose: Adversarial canoni- cal representation reconstruction network for cate gory le vel 6d object pose estimation, ” in Pr oceedings of the 2024 Interna- tional Confer ence on Multimedia Retrieval , 2024, pp. 55–63. [28] Pengyuan W ang, T akuya Ikeda, Robert Lee, and K oichi Nishi- waki, “Gs-pose: Category-le vel object pose estimation via ge- ometric and semantic correspondence, ” in Eur opean Confer- ence on Computer V ision . Springer , 2024, pp. 108–126.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment