Cross-Modal Fusion for Indoor RGB-D Semantic Segmentation with Transformers: Introducing CRDF

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Indoor scene understanding is crucial for intelligent robotics, and RGB-D imagesprovide complementary depth information to enhance semantic segmentation.However, integrating multi-modal data poses challenges due to their inherentdifferences. In this paper, we propose CRDF, a Cross-modal Fusion frame-work specifically tailored for indoor scene RGB-D semantic segmentation usingTransformers. CRDF introduces the Pattern-Variable Feature Rectification (PV-FR) and Pattern-Variable Feature Fusion (PV-FF) modules, which effectivelyextract and fuse multi-scale features from RGB and depth images. By reducingattention computations on depth images during downsampling, CRDF achievesfast convergence and robust performance. Experiments on the NYU Depth v2and SUN-RGBD datasets demonstrate the effectiveness of CRDF, achievingstate-of-the-art results with 55.51% mIoU and 67.73% MPA on NYU Depthv2, and 52.19% mIoU and 64.58% MPA on SUN-RGBD. Code is available at:https://github.com/tqqwww/CRDF.

Article activity feed