Learning Emotional Nuances in Speech via DCNNs and Spectral Feature Integration

K. Venkatesh Sharma
Pramod Reddy
Rakesh Betala
Madhavi Pappula
Shirisha Reddy K

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Convolutional Neural Networks (CNNs) have demonstrated remarkable performance in a variety of pattern recognition tasks in recent years, particularly in computer vision and speech analysis. Their fixed grid-based sampling, however, restricts their capacity to simulate the geometric deformations and transformations found in actual data. In order to overcome this constraint, this study explores the application of Deformable Convolutional Neural Networks (DCNNs). By adding learnable offsets to the convolutional kernels' sampling locations, DCNNs improve on conventional CNNs by enabling the network to adaptively concentrate on informative areas. Using audio features like Mel-Frequency Cepstral Coefficients (MFCCs) and Mel spectrograms, the goal of this work is to create a DCNN-based model for real-time Speech Emotion Recognition (SER). RAVDESS, CREMA-D, and TESS are popular datasets that represent a range of emotional expressions and were used to train and assess the system. Significant gains in classification accuracy were shown by the suggested model, especially in identifying minute emotional differences between speakers. The study emphasizes how deformable convolutions, as opposed to traditional CNNs, offer greater flexibility and generalisation when it comes to capturing intricate patterns in speech signals. A strong architecture appropriate for real-time emotion-aware applications, including virtual assistants, mental health monitoring, and human-computer interaction systems, is presented in this work, which advances the field of affective computing.

Version published to 10.21203/rs.3.rs-6738140/v1 on Research Square
Sep 2, 2025

Feature Significance in Speech Emotion Recognition

This article has 2 authors:
1. Atul Mishra
2. Sarthak Jindal
This article has no evaluationsLatest version Aug 28, 2025
A Deep Learning Framework for Emotion Recognitionin Music Using Multimodal Data Fusion

This article has 1 author:
1. Runhua Li
This article has no evaluationsLatest version Sep 19, 2025
CrySenseNet: A Deep Learning-Based Acoustic Intelligence System for Decoding Infant Cries

This article has 7 authors:
1. Krishna S
2. Anushka B R
3. Swetha Saju
4. Amrutha K V
5. Devika S Babu
6. Sishu Shankar Muni
7. Swetha P
This article has no evaluationsLatest version Sep 29, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Feature Significance in Speech Emotion Recognition

A Deep Learning Framework for Emotion Recognitionin Music Using Multimodal Data Fusion

CrySenseNet: A Deep Learning-Based Acoustic Intelligence System for Decoding Infant Cries