DCA-CL: Enhancing Multimodal Emotion Recognition via Dual Cross Attention and Contrastive Learning

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Emotion is a subjective human response to external events or stimuli and plays a crucial role across various application domains. Consequently, emotion recognition has become a central focus of research. However, existing mainstream approaches still face several challenges, such as limited interaction across different modalities and low recognition accuracy when dealing with limited samples involving semantically similar but categorically distinct emotions. To tackle these challenges, we introduce a new multimodal emotion recognition framework, named DCA-CL (Dual Cross Attention with Contrastive Learning), which aims to improve the integration and effectiveness of cross-modal information. The proposed model incorporates a feature fusion network that combines bidirectional cross-modal attention with self-attention mechanisms, enabling effective modeling of both intra-modal and cross-modal interactions. Furthermore, a temporal gating mechanism is adopted to filter salient features and suppress redundant information, while dynamic weight allocation facilitates efficient fusion of modality-specific features. During the training phase, a dynamic modal distillation mechanism is introduced to dynamically select the optimal teacher mode based on modal quality, guiding weak modes to learn high-quality semantic features and enhance their ability to represent weak modes;To enhance recognition accuracy in few-shot settings and among semantically close emotion categories, we incorporate a dynamic focal contrastive loss, which boosts the model’s ability to learn discriminative representations. Experiments conducted on the IEMOCAP and MELD datasets confirm that the proposed DCA-CL framework delivers outstanding overall performance.

Article activity feed