Emotion Recognition for Mental Health Assessment Using Transformer-Based Multimodal Learning

Rahat Malik
Bushra Siddique
Erum Ashraf
Hafiz Ishfaq

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The complex dynamics of artificial intelligence, the mental health detection tools and frameworks are increasingly mounting different intelligent approaches and strategies to support early assessment and intervention through emotion detection. However, a lot of these existing system uses approaches of unimodal feature extraction and only early fusion strategies which often limits robustness and generalization in real world mental health scenarios. This paper presents \textit{MindMed AI}, a deep learning framework that employs multimodal feature extraction and transformer based analysis of emotion detection to assist in proactive mental health support. This AI framework is a compilation of three individual unimodal models basing HUBERT and OpenSMILE for voice based analysis, a Data efficient Image Transformers (DeiT) for facial emotion recognition and BERT for text based emotion analysis. The paper carries a comparative study of efficiency of unimodel models with early and intermediate fusion. The results show that the individual model analysis demonstrates the hightest accuracy of 91.22\% in case of acoustic evaluation, while the use of intermediate fusion outperformed in accuracy in comparison to both individual unimodal evaluations and early fusion with highest score of 91.89\%. The statistical compilation and comparison of results defined the significance of structured multimodal integration in enhanced emotion recognition accuracy and robustness, highlighting the potential of transformer based intermediate fusion for scalable and reliable mental health monitoring applications.

Version published to 10.21203/rs.3.rs-8613173/v1 on Research Square
Feb 9, 2026

A Comprehensive Review in Unimodal and Multimodal Emotion Recognition

This article has 39 authors:
1. Jiachen Luo
2. Qu Yang
3. Jiajun He
4. Yining Hua
5. Zheng Lian
6. Yuanchao Li
7. Siyang Song
8. Wen Wu
9. Dingdong Wang
10. Shuai Shen
11. Jingyao Wu
12. Guimin Hu
13. He Hu
14. Yong Li
15. Zixing Zhang
16. Jiadong Wang
17. Sifan Zhou
18. Zuojin Tang
19. Canran Xiao
20. Sheng Xu
21. Zhenjun Zhao
22. Xiangyang Xue
23. Sicheng Zhao
24. Yong Dai
25. Tomoki Toda
26. Licai Sun
27. Kailai Yang
28. Liyun Zhang
29. Cong Cai
30. Jiamin Du
31. Ziyang Ma
32. Mingjie Chen
33. Chengxuan Qian
34. Zhenlong Yuan
35. Xie Chen
36. Huy Phan
37. Lin Wang
38. Björn Schuller
39. Joshua Reiss
This article has no evaluationsLatest version Mar 30, 2026
Evaluating Early, Late and Hybrid Fusion in Multimodal Emotion Detection with Pretrained Models

This article has 3 authors:
1. Syed Riyas Ahamed
2. Sandip Saha
3. Awani Bhushan
This article has no evaluationsLatest version Apr 13, 2026
Multimodal Fusion of EEG and Physiological Signals for Robust Emotion Recognition via Machine Learning

This article has 1 author:
1. Shital R.Shegokar
This article has no evaluationsLatest version Feb 26, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

A Comprehensive Review in Unimodal and Multimodal Emotion Recognition

Evaluating Early, Late and Hybrid Fusion in Multimodal Emotion Detection with Pretrained Models

Multimodal Fusion of EEG and Physiological Signals for Robust Emotion Recognition via Machine Learning