A Comprehensive Review in Unimodal and Multimodal Emotion Recognition
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Emotion recognition is a fundamental component of human-centered intelligent systems, supporting applications in healthcare, education, marketing, and human–computer interaction. Despite rapid progress driven by deep learning across facial, speech, textual, and multi-modal settings, the literature remains difficult to compare due to inconsistent emotion models, heterogeneous datasets, and varying evaluation protocols. This survey addresses this gap by providing a unified synthesis of deep learning-based uni-modal and multi-modal emotion recognition within a coherent analytical framework covering emotion modeling, dataset curation, representation learning, fusion strategies, and evaluation. Rather than listing methods, we organize existing work around key structural choices and trade-offs that affect generalization. For uni-modal approaches, we analyze how facial, speech, and textual methods increasingly rely on self-supervised pretraining to mitigate annotation scarcity, while retaining modality-specific limitations. For multi-modal systems, we examine alignment, modality dominance, complementarity, robustness, and the emerging role of large language models in affective reasoning. We further highlight persistent challenges, including label ambiguity, cross-dataset generalization, fairness, and the gap between benchmark performance and real-world deployment. This survey provides a unified perspective and a roadmap for future research. Resources are available at https://github.com/jackchen69/Awesome-Emotion-Models.