TAC-Net:Triple Attention Contrastive Network for Speech Complex Emotion Recognition in Real-Scene
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Speech Emotion Recognition (SER) in real-world scenarios remains a challenging task due to uncontrolled acoustic interference and the co-occurrence of complex emotions. Existing single-label approaches often struggle to capture fine-grained emotional semantics while suppressing noise, leading to significant performance degradation in complex environments. To address this, we propose a Triple Attention Contrastive Network (TAC-Net), designed to enhance model robustness through multi-dimensional feature decoupling and reconstruction. First, we introduce a Label-wise Attention Module comprising three collaborative networks: Multi-label Attention (MLA) explicitly models the latent dependencies among emotion tags; Local-Global Attention (LGA) locates key emotional frames temporally while capturing long-range context; and Time-Frequency Attention (TFA) focuses on spectral energy distributions to mitigate channel distortion. Furthermore, a Contrastive Reconstruction-based Fusion Module is incorporated to align heterogeneous features in the latent space via contrastive learning, effectively filtering redundant noise while preserving critical emotional information. Experimental results on the large-scale M³ED and CMU-MOSEI datasets demonstrate that TAC-Net not only significantly outperforms existing unimodal baselines but also achieves competitive performance comparable to multimodal methods in handling multi-label and noisy data.