Cross-Modal Temporal Attention for Robust Multimodal Emotion Recognition

Briar Calloway
Wyne Nasir
Caelum Finch

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

With the growing integration of intelligent systems into daily life, affective computing has seen a significant surge in relevance. Understanding human emotional responses in complex, real-world scenarios has broad implications, spanning domains such as human-computer interaction, entertainment, autonomous vehicles, and mental health surveillance. To this end, the CVPR 2023 Affective Behavior Analysis in-the-wild (ABAW) Competition motivates the development of robust methods capable of interpreting spontaneous, multimodal emotional expressions under unconstrained conditions. In this paper, we present \textbf{EMMA-Net} (Emotion-aware Multimodal Attention Network), our proposed solution to the ABAW challenge. EMMA-Net capitalizes on a variety of heterogeneous input streams—including audio, facial visuals, and body pose trajectories—extracted from video data to perform temporal emotion inference. Departing from traditional strategies that process each modality independently, we propose a temporally-informed cross-attention fusion framework that captures latent intermodal correlations and aligns their temporal flows, enabling more contextually grounded emotion predictions. Each stream is individually processed using modality-specific backbone encoders, followed by selective aggregation through a multimodal attention mechanism. Our design gives equal importance to both short-term temporal coherence and long-range contextual dependencies, ensuring that fleeting emotional cues are interpreted within a broader affective context. When evaluated on the Aff-Wild2 validation dataset, EMMA-Net attains a performance score of 0.418, showcasing the strength of cross-domain attention-based fusion.

Version published to 10.20944/preprints202506.2314.v1
Jun 27, 2025

Multi-Interaction Modeling with Intelligent Coordination for Multimodal Emotion Recognition

This article has 3 authors:
1. Cambria Ellis
2. Wyne Nasir
3. Linden Porter
This article has no evaluationsLatest version May 16, 2025
Development of an Interactive Digital Human with Context-Sensitive Facial Expressions

This article has 5 authors:
1. Fan Yang
2. Lei Fang
3. Jing Zhang
4. Mincheol Whang
5. Hongguo Ren
This article has no evaluationsLatest version Jun 17, 2025
Multigranular Unified Synthesis Encoder for Fine-grained Multimodal Emotion Understanding

This article has 3 authors:
1. Colton Ray
2. Wyne Nasir
3. Savannah Grace
This article has no evaluationsLatest version May 16, 2025

Listed in

Abstract

Article activity feed

Related articles

Multi-Interaction Modeling with Intelligent Coordination for Multimodal Emotion Recognition

Development of an Interactive Digital Human with Context-Sensitive Facial Expressions

Multigranular Unified Synthesis Encoder for Fine-grained Multimodal Emotion Understanding