Emotion Recognition for Mental Health Assessment Using Transformer-Based Multimodal Learning

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The complex dynamics of artificial intelligence, the mental health detection tools and frameworks are increasingly mounting different intelligent approaches and strategies to support early assessment and intervention through emotion detection. However, a lot of these existing system uses approaches of unimodal feature extraction and only early fusion strategies which often limits robustness and generalization in real world mental health scenarios. This paper presents \textit{MindMed AI}, a deep learning framework that employs multimodal feature extraction and transformer based analysis of emotion detection to assist in proactive mental health support. This AI framework is a compilation of three individual unimodal models basing HUBERT and OpenSMILE for voice based analysis, a Data efficient Image Transformers (DeiT) for facial emotion recognition and BERT for text based emotion analysis. The paper carries a comparative study of efficiency of unimodel models with early and intermediate fusion. The results show that the individual model analysis demonstrates the hightest accuracy of 91.22\% in case of acoustic evaluation, while the use of intermediate fusion outperformed in accuracy in comparison to both individual unimodal evaluations and early fusion with highest score of 91.89\%. The statistical compilation and comparison of results defined the significance of structured multimodal integration in enhanced emotion recognition accuracy and robustness, highlighting the potential of transformer based intermediate fusion for scalable and reliable mental health monitoring applications.

Article activity feed