Fusion-VTT: Visual-Tactile-Text Fusion Learning for Robotic Object Recognition

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Multimodal fusion is a promising approach to enhance environmental perception and object recognition for robotic systems. However, the inherent heterogeneity and semantic discrepancies among visual, tactile, and textual modalities pose significant challenges for feature fusion. This paper proposes a novel hierarchical fusion framework, Fusion-VTT, designed to achieve deep feature-level fusion across visual, tactile, and textual modalities. The framework first employs Patch Embedding and a Pre-trained Bidirectional Encoder Representations from Transformers (BERT) model to extract features from three modalities, followed by spatial alignment. It then innovatively incorporates parameter-sharing cross-attention mechanism to establish six bidirectional interaction pathways between modality pairs, thereby capturing fine-grained cross-modal correlations. Subsequently, a self-attention module is applied to integrate features into a globally consistent representation. To evaluate the proposed method, a new multimodal dataset comprising 20 categories of common household objects was compiled. Experimental results on this custom dataset and the public MSDO dataset demonstrate that Fusion-VTT achieves a recognition accuracy of 99.23%, substantially outperforming existing baseline methods and confirming the effectiveness of the proposed fusion strategy.

Article activity feed