Modeling Multimodal Emotion with Dynamic Interaction-Focused Representation Network

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Understanding human emotions through multimodal signals has become a pivotal task in affective computing and human-computer interaction. Among the multiple modalities, text and audio jointly deliver rich and complementary emotional cues. However, a key challenge lies in the temporal misalignment between these modalities, making it difficult to fuse them into a coherent emotional representation. In this work, we propose a novel framework named DIFERNet (Dynamic Interaction-Focused Emotion Representation Network), which directly learns robust and discriminative fused features from unaligned text and audio sequences. Unlike prior works that often rely on strict alignment or shallow fusion techniques, our method dynamically adapts to the unique characteristics of each modality while emphasizing their interdependencies. The architecture of DIFERNet comprises three main components: (1) a crossmodal dimensional alignment module that ensures feature compatibility between heterogeneous inputs; (2) an interaction-guided attention mechanism that facilitates deep crossmodal synergy for initializing the fused embeddings; and (3) a dynamic fusion adaptation transformer, which refines the fused representation in a modality-preserving manner. This final module serves as a correction mechanism to retain crucial unimodal semantics while enhancing contextual understanding across modalities. We conduct extensive evaluations on two widely-used sentiment benchmarks, CMU-MOSI and CMU-MOSEI, to validate the proposed approach. Experimental results indicate that DIFERNet consistently outperforms existing baselines, showing marked improvements across all key metrics. Furthermore, qualitative analysis demonstrates its capacity to appropriately regulate sentiment predictions by leveraging nuanced acoustic features. These findings highlight the potential of DIFERNet for multimodal sentiment analysis in real-world, asynchronous environments.

Article activity feed