Parallel Network Speech Emotion Recognition Based on Hybrid Attention Mechanism
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In speech emotion recognition, insufficient feature extraction and single-feature limitations often lead to low recognition accuracy. To address these issues, thesis proposes a parallel network structure with a hybrid attention mechanism, integrating multi-scale feature extraction and temporal modeling to enhance performance. The model maps 81-dimensional combined features to 128 dimensions via an embedding layer, enriching feature representation for subsequent layers. These features are then processed by three parallel networks, each comprising a multi-scale dilated convolution module, a bidirectional long short-term memory module, and a hybrid attention mechanism. The multi-scale dilated convolution extracts global contextual information, improving long-term dependency capture, while the bidirectional long short-term memory models temporal dependencies, capturing emotional variations over time. The hybrid attention mechanism further refines feature weighting across channel and temporal dimensions. Experiments on the RAVDESS dataset demonstrate that the proposed method achieves 96.61% accuracy and 96.52% precision in an 8-class emotion classification task, outperforming traditional convolutional neural network, bidirectional long short-term memory module, and other attention-based models. These results highlight its effectiveness in extracting and integrating speech emotion features, improving classification accuracy, and offering a novel solution for speech emotion recognition.