Feature-level interaction and adaptive fusion model based on cross-modal attention for audiovisual emotion recognition

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Emotion recognition holds significant applications in fields such as natural language processing, computer vision, and speech recognition. However, traditional unimodal methods struggle to comprehensively capture the diversity of emotional expressions, while existing multimodal methods often focus on the textual modality and lack sufficient exploration in feature-level correlation. To address this, this paper proposes a feature-level interaction and adaptive fusion model based on cross-modal attention. Specifically, the model first extracts emotional representations from audio and visual modalities and aligns them in a shared space. Subsequently, a self-attention module is utilized for intra-modal modeling to capture intra-modal temporal dependencies. Simultaneously, we propose a cross-modal attention computation method based on feature-level interaction to explore fine-grained correlations and information complementarity at the temporal and feature levels between modalities. Finally, an adaptive fusion strategy is adopted to automatically learn modal weights, further enhancing modal complementarity. Experimental results demonstrate that the proposed model exhibits superior performance on both RAVDESS and IEMOCAP datasets, effectively improving the accuracy and robustness of multimodal sentiment analysis. The code is available at https://github.com/cstan-chun/MAMF/tree/master.

Article activity feed