Evaluating Early, Late and Hybrid Fusion in Multimodal Emotion Detection with Pretrained Models

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Recognizing emotions in conversations is crucial for creating socially aware agents, but combining different cues from speech, language and facial expressions is a tough task. This study looks at how simple yet well-designed fusion strategies can improve multimodal emotion recognition using the Multimodal EmotionLines Dataset. It uses existing encoders for text, speech and faces to represent utterances as complex embeddings and combines them through four fusion methods: early fusion, late fusion, a hybrid average and a lightweight meta-classifier. The results show that the proposed framework outperforms strong single-modal baselines, with early fusion already doing better than a text-only model and hybrid/meta-fusion achieving the highest accuracy and weighted F1 score, especially for strong emotions such as anger, joy and surprise. By analyzing the performance of each class and confusion patterns, it's clear that hybrid and meta-fusion methods use the strengths of feature- and score-level integration while keeping the number of task-specific parameters low. These findings make this pipeline a reliable and practical benchmark for multimodal emotion recognition in conversations and demonstrate that well-designed fusion strategies can provide competitive performance without needing complex architectures. The study's approach is simple, yet effective and can be used to improve emotion recognition in various applications. Improving emotion recognition can create more natural and engaging interactions between humans and machines, which is critical for various applications, including customer service, healthcare and education. The study's findings also have implications for the development of more advanced fusion strategies that can further improve emotion recognition accuracy. By exploring different fusion methods and techniques, researchers can create more sophisticated models that can better capture the complexities of human emotions and provide more accurate recognition results. This can lead to the creation of more effective socially aware agents that can provide better support and services to humans, which is essential for creating a more harmonious and interactive human-machine interface.

Article activity feed