Learning Emotional Nuances in Speech via DCNNs and Spectral Feature Integration
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Convolutional Neural Networks (CNNs) have demonstrated remarkable performance in a variety of pattern recognition tasks in recent years, particularly in computer vision and speech analysis. Their fixed grid-based sampling, however, restricts their capacity to simulate the geometric deformations and transformations found in actual data. In order to overcome this constraint, this study explores the application of Deformable Convolutional Neural Networks (DCNNs). By adding learnable offsets to the convolutional kernels' sampling locations, DCNNs improve on conventional CNNs by enabling the network to adaptively concentrate on informative areas. Using audio features like Mel-Frequency Cepstral Coefficients (MFCCs) and Mel spectrograms, the goal of this work is to create a DCNN-based model for real-time Speech Emotion Recognition (SER). RAVDESS, CREMA-D, and TESS are popular datasets that represent a range of emotional expressions and were used to train and assess the system. Significant gains in classification accuracy were shown by the suggested model, especially in identifying minute emotional differences between speakers. The study emphasizes how deformable convolutions, as opposed to traditional CNNs, offer greater flexibility and generalisation when it comes to capturing intricate patterns in speech signals. A strong architecture appropriate for real-time emotion-aware applications, including virtual assistants, mental health monitoring, and human-computer interaction systems, is presented in this work, which advances the field of affective computing.