Evaluating Early, Late and Hybrid Fusion in Multimodal Emotion Detection with Pretrained Models

Syed Riyas Ahamed
Sandip Saha
Awani Bhushan

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Recognizing emotions in conversations is crucial for creating socially aware agents, but combining different cues from speech, language and facial expressions is a tough task. This study looks at how simple yet well-designed fusion strategies can improve multimodal emotion recognition using the Multimodal EmotionLines Dataset. It uses existing encoders for text, speech and faces to represent utterances as complex embeddings and combines them through four fusion methods: early fusion, late fusion, a hybrid average and a lightweight meta-classifier. The results show that the proposed framework outperforms strong single-modal baselines, with early fusion already doing better than a text-only model and hybrid/meta-fusion achieving the highest accuracy and weighted F1 score, especially for strong emotions such as anger, joy and surprise. By analyzing the performance of each class and confusion patterns, it's clear that hybrid and meta-fusion methods use the strengths of feature- and score-level integration while keeping the number of task-specific parameters low. These findings make this pipeline a reliable and practical benchmark for multimodal emotion recognition in conversations and demonstrate that well-designed fusion strategies can provide competitive performance without needing complex architectures. The study's approach is simple, yet effective and can be used to improve emotion recognition in various applications. Improving emotion recognition can create more natural and engaging interactions between humans and machines, which is critical for various applications, including customer service, healthcare and education. The study's findings also have implications for the development of more advanced fusion strategies that can further improve emotion recognition accuracy. By exploring different fusion methods and techniques, researchers can create more sophisticated models that can better capture the complexities of human emotions and provide more accurate recognition results. This can lead to the creation of more effective socially aware agents that can provide better support and services to humans, which is essential for creating a more harmonious and interactive human-machine interface.

Version published to 10.21203/rs.3.rs-8907947/v1 on Research Square
Apr 13, 2026

SCoPE: Shift-Aware Speaker-Conditioned Priors for Emotion Recognition in Conversations

This article has 2 authors:
1. Burak Can Kaplan
2. Stefan Wermter
This article has no evaluationsLatest version Apr 7, 2026
A Novel Hybrid Deep Learning Model for Aspect-Based Sentiment Analysis (ABSA)

This article has 3 authors:
1. PRIYANKA SHAKYA
2. AANCHAL SINGH
3. Ashish Kumar Mishra
This article has no evaluationsLatest version Apr 9, 2026
Emotion Recognition Via Deep Learning Based Fuzzy CapsuleNet

This article has 3 authors:
1. Thilagavathy A
2. Lalitha S.D
3. Kannamma R
This article has no evaluationsLatest version Apr 6, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

SCoPE: Shift-Aware Speaker-Conditioned Priors for Emotion Recognition in Conversations

A Novel Hybrid Deep Learning Model for Aspect-Based Sentiment Analysis (ABSA)

Emotion Recognition Via Deep Learning Based Fuzzy CapsuleNet