A Comprehensive Review in Unimodal and Multimodal Emotion Recognition

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Emotion recognition is a fundamental component of human-centered intelligent systems, supporting applications in healthcare, education, marketing, and human–computer interaction. Despite rapid progress driven by deep learning across facial, speech, textual, and multi-modal settings, the literature remains difficult to compare due to inconsistent emotion models, heterogeneous datasets, and varying evaluation protocols. This survey addresses this gap by providing a unified synthesis of deep learning-based uni-modal and multi-modal emotion recognition within a coherent analytical framework covering emotion modeling, dataset curation, representation learning, fusion strategies, and evaluation. Rather than listing methods, we organize existing work around key structural choices and trade-offs that affect generalization. For uni-modal approaches, we analyze how facial, speech, and textual methods increasingly rely on self-supervised pretraining to mitigate annotation scarcity, while retaining modality-specific limitations. For multi-modal systems, we examine alignment, modality dominance, complementarity, robustness, and the emerging role of large language models in affective reasoning. We further highlight persistent challenges, including label ambiguity, cross-dataset generalization, fairness, and the gap between benchmark performance and real-world deployment. This survey provides a unified perspective and a roadmap for future research. Resources are available at https://github.com/jackchen69/Awesome-Emotion-Models.

Article activity feed