Cross-Modal Temporal Attention for Robust Multimodal Emotion Recognition
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
With the growing integration of intelligent systems into daily life, affective computing has seen a significant surge in relevance. Understanding human emotional responses in complex, real-world scenarios has broad implications, spanning domains such as human-computer interaction, entertainment, autonomous vehicles, and mental health surveillance. To this end, the CVPR 2023 Affective Behavior Analysis in-the-wild (ABAW) Competition motivates the development of robust methods capable of interpreting spontaneous, multimodal emotional expressions under unconstrained conditions. In this paper, we present \textbf{EMMA-Net} (Emotion-aware Multimodal Attention Network), our proposed solution to the ABAW challenge. EMMA-Net capitalizes on a variety of heterogeneous input streams—including audio, facial visuals, and body pose trajectories—extracted from video data to perform temporal emotion inference. Departing from traditional strategies that process each modality independently, we propose a temporally-informed cross-attention fusion framework that captures latent intermodal correlations and aligns their temporal flows, enabling more contextually grounded emotion predictions. Each stream is individually processed using modality-specific backbone encoders, followed by selective aggregation through a multimodal attention mechanism. Our design gives equal importance to both short-term temporal coherence and long-range contextual dependencies, ensuring that fleeting emotional cues are interpreted within a broader affective context. When evaluated on the Aff-Wild2 validation dataset, EMMA-Net attains a performance score of 0.418, showcasing the strength of cross-domain attention-based fusion.