TAC-Net:Triple Attention Contrastive Network for Speech Complex Emotion Recognition in Real-Scene

Hankiz Yilahun
Chaobo Song
Askar Hamdulla

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Speech Emotion Recognition (SER) in real-world scenarios remains a challenging task due to uncontrolled acoustic interference and the co-occurrence of complex emotions. Existing single-label approaches often struggle to capture fine-grained emotional semantics while suppressing noise, leading to significant performance degradation in complex environments. To address this, we propose a Triple Attention Contrastive Network (TAC-Net), designed to enhance model robustness through multi-dimensional feature decoupling and reconstruction. First, we introduce a Label-wise Attention Module comprising three collaborative networks: Multi-label Attention (MLA) explicitly models the latent dependencies among emotion tags; Local-Global Attention (LGA) locates key emotional frames temporally while capturing long-range context; and Time-Frequency Attention (TFA) focuses on spectral energy distributions to mitigate channel distortion. Furthermore, a Contrastive Reconstruction-based Fusion Module is incorporated to align heterogeneous features in the latent space via contrastive learning, effectively filtering redundant noise while preserving critical emotional information. Experimental results on the large-scale M³ED and CMU-MOSEI datasets demonstrate that TAC-Net not only significantly outperforms existing unimodal baselines but also achieves competitive performance comparable to multimodal methods in handling multi-label and noisy data.

Version published to 10.21203/rs.3.rs-8807080/v1 on Research Square
Feb 26, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed