Speech Emotion Recognition Using Multiscale Global-Local Representation Learning with Feature Pyramid Network

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Speech emotion recognition (SER) is important in facilitating natural human-computer interactions. In speech sequence modelling, a vital challenge is to learn context-aware sentence expression and temporal dynamics of para-linguistic features to achieve unambiguous emotional semantic understand-ing. In previous studies, the SER method based on the single-scale cascade feature extraction module could not effectively preserve the temporal structure of speech signals in the deep layer, downgrading the sequence modeling performance. In this paper, we propose a novel multi-scale feature pyramid network to mitigate the above limitations. With the aid of the bi-directional feature fusion of the pyramid network, the emotional representation with adequate temporal semantics is obtained. Experiments on the IEMOCAP corpus demonstrate the effectiveness of the proposed methods and achieve competitive results under speaker-independent validation.

Article activity feed