TAC-Net:Triple Attention Contrastive Network for Speech Complex Emotion Recognition in Real-Scene

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Speech Emotion Recognition (SER) in real-world scenarios remains a challenging task due to uncontrolled acoustic interference and the co-occurrence of complex emotions. Existing single-label approaches often struggle to capture fine-grained emotional semantics while suppressing noise, leading to significant performance degradation in complex environments. To address this, we propose a Triple Attention Contrastive Network (TAC-Net), designed to enhance model robustness through multi-dimensional feature decoupling and reconstruction. First, we introduce a Label-wise Attention Module comprising three collaborative networks: Multi-label Attention (MLA) explicitly models the latent dependencies among emotion tags; Local-Global Attention (LGA) locates key emotional frames temporally while capturing long-range context; and Time-Frequency Attention (TFA) focuses on spectral energy distributions to mitigate channel distortion. Furthermore, a Contrastive Reconstruction-based Fusion Module is incorporated to align heterogeneous features in the latent space via contrastive learning, effectively filtering redundant noise while preserving critical emotional information. Experimental results on the large-scale M³ED and CMU-MOSEI datasets demonstrate that TAC-Net not only significantly outperforms existing unimodal baselines but also achieves competitive performance comparable to multimodal methods in handling multi-label and noisy data.

Article activity feed