Category-Level 6D Pose Estimation Based on Deep Cross-Modal Feature Fusion
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Category-level 6D pose estimation methods aim to predict the rotation, translation, and size of unseen objects in a given category. RGB-D based dense correspondence methods have achieved leading performance. However, due to the differences in textures and shapes of the objects within a category, the object masks acquired by previous instance segmentation methods may be defective, resulting in inaccurate object point clouds acquired by depth map back-projection and RGB image patches acquired by cropping. Existing fusion methods that directly stitch RGB and geometric features cannot obtain accurate fused features. To solve these problems, we propose a new data processing method to improve the accuracy of the input data. The object position information provided by the object detection algorithm is fused with the image embedding information extracted through the vision transformer to obtain an accurate object mask. In addition, we introduce a new implicit fusion strategy that employs a cross-attention mechanism to align two different semantic features and subsequently reason about the fused features of the two different input data through a transformer-based architecture. We demonstrate the approach’s effectiveness by conducting experiments on two publicly available datasets, REAL275 and CAMERA275.