Emotion Recognition from Bangla Dialect Speech using Privacy-Aware Deep Learning Models: A Comparative Analysis

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Speech emotion recognition (SER) is critical for creating affective and context-aware human-computer interaction systems. However, SER research in low-resource languages like Bangla is still restricted, especially in terms of dialectal variety and privacy-preserving model training. This paper introduces a Bangla dialect-sensitive, privacy-aware SER framework capable of recognizing five distinct emotional states: neutral, happy, sad, angry, and surprise.We study three hybrid deep learning architectures: a composite EfficientNet-Vision Transformer (EfficientNet-ViT) model, CNN-BiLSTM for extracting spatial-temporal patterns where EmoDARTS using differentiable architecture search for automatic optimization. With a 93.0% F1-score and 95.9% accuracy, EfficientNet-ViT outperforms the other models in a federated learning setting while maintaining data security among dispersed devices. To address data scarcity and improve model generalisability, we use a cross-lingual transfer learning technique. Models are pretrained on high-resource English SER datasets (RAVDESS, SAVEE, and TESS) and then fine-tuned on Bangla datasets (SUBESCO, BanglaSER, and a freshly generated dialect-rich corpus). The suggested technique efficiently handles the issues of dialectal diversity, resource constraints, and privacy in Bangla SER. This methodology shows great promise for scalable implementation in real-world applications and offers a reproducible blueprint for SER in other low-resource language environments.

Article activity feed