Disentangled Representation Learning with Temporal Smoothness Constraints for Multimodal Sentiment Analysis

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The goal of multimodal sentiment analysis is to efficiently identify and interpret human emotions by integrating multiple modalities (e.g., text, audio, and video). Traditional representation learning techniques often fail to adequately address inter-modal heterogeneity and temporal continuity, particularly as multimodal sentiment analysis tasks grow in complexity. Consequently, these methods struggle to achieve effective cross-modal fusion while mitigating redundant information and noise interference. To address these challenges, we propose DRTSC, a novel multimodal sentiment analysis framework. First, the framework employs disentangled representation learning to extract shared and private features; introduces a temporal smoothness loss to enforce consistency in audio and video features; and incorporates adversarial loss with backward tuning. Second, a textual hierarchical guidance module coordinates audio and video emotional expressions by leveraging affective cues from text. Finally, efficient feature fusion is achieved through cross-modal interaction layers. Extensive experiments on CMU-MOSI and CMU-MOSEI benchmarks demonstrate that the proposed model achieves state-of-the-art performance in sentiment analysis tasks.

Article activity feed