Audio-Textual Emotion Recognition using Pre-trained models: Investigating Various Representations and Fusion Techniques

Zahra Dehghani Tafti
Bagher BabaAli

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Emotion recognition is vital in Human-Computer Interaction, enhancing artificial intelligence with emotional intelligence. Given the multimodal nature of human conversations, emotions can be detected through various modalities, leading to more accurate recognition. This makes multimodal emotion recognition a popular yet challenging research area. In this study, we focused on recognizing emotions from both audio and text modalities on the IEMOCAP dataset. We utilized transfer learning and fine-tuned transformer models for each modality, aiming to minimize the number of trainable parameters in the final system. One of the main challenges in multimodal emotion recognition lies in effectively fusing different modalities. To address this, we employed early fusion, cross-modal fusion, and late fusion techniques to integrate information from the audio and text models, using representations extracted from different layers of each model. Our results indicate that using the average of mean pooling across all layers for each modality, combined with an early fusion approach and a support vector machine (SVM) classifier, achieved the best performance. This approach resulted in an unweighted average recall (UAR) of 78.42%, a weighted average recall (WAR) of 77.75%, and a cross-entropy loss of 0.67, outperforming previous studies ² .

Version published to 10.21203/rs.3.rs-4963739/v1 on Research Square
Sep 27, 2024

Adaptive Contextualized Multi-feature Fusion Network for Robust Cross-Linguistic Speech Emotion Recognition

This article has 2 authors:
1. Haoyu Cen
2. Yutian Gai
This article has no evaluationsLatest version Dec 30, 2025
CLARA: Enhancing Multimodal Sentiment Analysis via Efficient Vision-Language Fusion

This article has 3 authors:
1. Phuong Lam
2. Phan Thi Tuoi
3. Thien Khai Tran
This article has no evaluationsLatest version Jan 7, 2026
Explainable Amharic Emotional Text Classification Using Transfer Learning

This article has 3 authors:
1. Demeke Endalie
2. Yeshimebet Bayu
3. Tesfa Tegegne
This article has no evaluationsLatest version Jan 13, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Adaptive Contextualized Multi-feature Fusion Network for Robust Cross-Linguistic Speech Emotion Recognition

CLARA: Enhancing Multimodal Sentiment Analysis via Efficient Vision-Language Fusion

Explainable Amharic Emotional Text Classification Using Transfer Learning