Bangla Speech Emotion Recognition Using Deep Learning-Based Ensemble Learning and Feature Fusion

Md. Shahid Ahammed Shakil
Nitun Kumar Podder
S.M. Hasan Sazzad Iqbal
Abu Saleh Musa Miah
Md Abdur Rahim

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Emotion recognition in speech is essential for enhancing human-computer interaction (HCI) systems. Despite progress in Bangla speech emotion recognition, challenges remain, including low accuracy, speaker dependency, and poor generalization across emotional expressions. Previous approaches often rely on traditional machine learning or basic deep learning models, struggling with robustness and accuracy in noisy or varied data. In this study, we propose a novel multi-stream deep learning feature fusion approach for Bangla speech emotion recognition, addressing the limitations of existing methods. Our approach begins with various data augmentation techniques applied to the training dataset, enhancing the model’s robustness and generalization. We then extract a comprehensive set of handcrafted features, including Zero-Crossing Rate (ZCR), chromagram, spectral centroid, spectral roll-off, spectral contrast, spectral flatness, Mel-Frequency Cepstral Coefficients (MFCCs), Root Mean Square (RMS) energy, and Mel-spectrogram. These features capture key characteristics of the speech signal, providing valuable insights into the emotional content. Sequentially, we utilize a multi-stream deep learning architecture to automatically learn complex, hierarchical representations of the speech signal. This architecture consists of three distinct streams: the first stream uses 1D Convolutional Neural Networks (1D CNN), the second integrates 1D CNN with Long Short-Term Memory (LSTM), and the third combines 1D CNN with Bidirectional LSTM (Bi-LSTM). These models capture intricate emotional nuances that handcrafted features alone may not fully represent. For each of these models, we generate predicted scores, and then employ ensemble learning with a soft voting technique to produce the final prediction. This fusion of handcrafted features, deep learning-derived features, and ensemble voting enhances the accuracy and robustness of emotion identification across multiple datasets. Our method demonstrates the effectiveness of combining various learning models to improve emotion recognition in Bangla speech, providing a more comprehensive solution compared to existing methods. We utilize three primary datasets—SUBESCO, BanglaSER, and a merged version of both—as well as two external datasets, RAVDESS and EMODB, to assess the performance of our models. Our method achieves impressive results with accuracies of 92.90%, 85.20%, 90.63%, 67.71%, and 69.25% for the SUBESCO, BanglaSER, merged SUBESCO and BanglaSER, RAVDESS, and EMODB datasets, respectively. These results demonstrate the effectiveness of combining handcrafted features with deep learning-based features through ensemble learning for robust emotion recognition in Bangla speech.

Version published to 10.20944/preprints202503.1864.v1
Mar 25, 2025

Deep Learning-based Facial Expression Analysis for Video Emotion Recognition and Sentiment Prediction

This article has 2 authors:
1. Asha Priyadarshini. M
2. A. Krishna Mohan
This article has no evaluationsLatest version Mar 19, 2025
CNN in Neural Networks for Image-based Face Emotion Identification on Recognition Datasets

This article has 1 author:
1. Monalisa Hati
This article has no evaluationsLatest version Apr 15, 2025
Deep Learning-Based Speech Enhancement for Robust Sound Classification in Security Systems

This article has 4 authors:
1. Samuel Yaw Mensah
2. Tao Zhang
3. Nahid AI Mahmud
4. Yanzhang Geng
This article has no evaluationsLatest version Apr 14, 2025

Listed in

Abstract

Article activity feed

Related articles

Deep Learning-based Facial Expression Analysis for Video Emotion Recognition and Sentiment Prediction

CNN in Neural Networks for Image-based Face Emotion Identification on Recognition Datasets

Deep Learning-Based Speech Enhancement for Robust Sound Classification in Security Systems