Multi-scale and Multi-feature fusion speech emotion recognition based on cross-attention

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Speech Emotion Recognition (SER) which aims to help the machine to understand human emotions from speech, has emerged as an integral component within Human-computer Interaction (HCI). There are two critical challenges in the SER field. One is that rich emotional features at different scales cannot be well captured due to the restrictions of existing CNNs. The other is that due to the limitations of existing methods, it is difficult to fuse multiple feature information effectively. A multi-scale and multi-feature fusion speech emotion recognition model based on cross-attention is proposed in this paper. First, according to the characteristics of MFCC and log Mel spectrogram, 1D convolution and 2D convolution were used to extract their advanced features, respectively. Second, adding residual multi-scale module to convolutional neural networks aims at high-level emotional features at different scales and obtain richer fine-grained emotional features. Third, the features obtained after the convolutional neural network are fused using the cross-attention module, which aims to explicitly simulate the fine-grained interaction between multiple features and improve the effectiveness of multi-feature fusion. Finally, the fused features are fed to BiLSTM to extract temporal features, and it is fed into a fully connected classifier for emotion recognition. The experimental results on the benchmark dataset IEMOCAP show that this method improves WA and UA by 1.67% and 2.20% compared with other methods, respectively.

Article activity feed