Building a Gender-Bias-Resistant Super Corpus as a Deep Learning Baseline for Speech Emotion Recognition

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The focus on speech emotion recognition (SER) has dramatically increased in recent years, driven by the need for automatic speech recognition-based systems and smart assistants to enhance user experience by incorporating emotional content. While deep learning techniques have significantly advanced SER systems, their robustness concerning speaker gender and out-of-distribution data has not been thoroughly examined. Furthermore, standards for SER remain rooted in landmark papers from the 2000s, even though modern deep learning architectures can achieve comparable or superior results to the state-of-the-art of that era. In this research, we address these challenges by creating a new super corpus from existing databases, providing a larger pool of samples. We benchmark this dataset using various deep learning architectures, setting a new baseline for the task. Additionally, our experiments reveal that models trained on this super corpus not only demonstrate superior generalization and accuracy but also exhibit lower gender bias compared to models trained on individual databases. We further show that traditional pre-processing techniques, such as noise reduction and normalization, are insufficient to address inherent biases in the data. However, our data augmentation approach effectively shifts these biases, improving model fairness across gender groups and emotions and, in some cases, fully debiasing the models.

Article activity feed