Deep Temporal Features and Multi-Level Cross-Modal Attention Fusion for Multimodal Sentiment Analysis
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
To address the challenges of insufficient multimodal feature extraction and limited cross-modal semantic diversity and interaction in multimodal sentiment analysis, this paper introduces Deep Temporal Features and Multi-Level Cross-Modal Attention Fusion (DTMCAF). Initially, a deep temporal feature extractor is developed, creating a multimodal temporal modeling network that combines bidirectional LSTMs with multi-head self-attention to capture multimodal features. Next, hierarchical cross-modal attention mechanisms along with feature-enhancement attention modules are designed to facilitate thorough information exchange between different modalities. Additionally, gated fusion and multi-layer feature transformations are employed to strengthen multimodal representations. Lastly, a multi-component collaborative loss function is proposed to align cross-modal features and optimize sentiment representations. Comprehensive experiments conducted on the CMU-MOSI and CMU-MOSEI datasets demonstrate that the proposed method outperforms current state-of-the-art techniques in terms of correlation, accuracy, and F1 score, significantly enhancing the precision of multimodal sentiment analysis.