Real-Time Audio–Visual Emotion Detection for Human–AI Interaction Using a Cross-Modal Transformere

Nanda Gopal Malladi
Vrisheeka Mulakala
Deepa N

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The real-timeness, multi-modality, and actuality of the assessments are aimed at improving the human-to-artificial intelligence interaction in the context of a virtual learning environment using continuous observation and assessment of the learners’ facial expressions and speech. A multi-modal XMT architecture is developed so as to process visual features from the CNN-based face video frames and acoustic features from the speech using MFCCs jointly by projecting both into a common embedding space and integrating over multiple heads of self-attention along with the feedforward layers to account for temporal dependencies within a modality as well as the relationships across the modalities. Model training and evaluation was conducted on a merged corpus consisting of the CREMA-D and RAVDESS datasets using the 6 basic emotions of anger, disgust, fear, happiness, neutrality and sadness based on a balanced sample of 240 pairs of recorded samples across different speakers, conditions, and intensities of emotions (240 samples total). The results show that the multimodal XMT outperforms both the audio-only and video-only baseline systems consistently with an average recognition accuracy of 73% and an average macro precision, recall, and F1-score of 76%, 75% and 75%, respectively, while maintaining an endto- end latency of less than 1.5 seconds, allowing real-time interactivity on standard computing platforms. The emotion recognition framework contributes to this integrated virtual assistant environment by providing feedback that matches the learner’s emotion by changing tone, style of feedback, and manner of instruction. When a learner exhibits frustration or low self-confidence, additional support or encouragement will be provided through alternative means of engagement. Conversely, when learners are fully engaged, they will be provided with additional challenge as well as motivation through the content provided to them. Additionally, because of the robustness of multimodal fusion, should one mode of communication become partially ineffective—for instance, when an audio signal is degraded due to background noise or low lighting conditions—the assistant will still be able to effectively provide an accurate solution and continue to provide stable affective sensing information during the learner’s use. These results indicate not only the versatility of transformer-based multimodal architectures and curated audio-visual corpora in providing technical solutions for developing affect-aware personalized tutoring systems, but also that these technologies can contribute to the development of other emotionally intelligent human-AI interfaces and systems used in education and other interactive areas.

Version published to 10.21203/rs.3.rs-8945109/v1 on Research Square
Mar 12, 2026

A Comprehensive Review in Unimodal and Multimodal Emotion Recognition

This article has 39 authors:
1. Jiachen Luo
2. Qu Yang
3. Jiajun He
4. Yining Hua
5. Zheng Lian
6. Yuanchao Li
7. Siyang Song
8. Wen Wu
9. Dingdong Wang
10. Shuai Shen
11. Jingyao Wu
12. Guimin Hu
13. He Hu
14. Yong Li
15. Zixing Zhang
16. Jiadong Wang
17. Sifan Zhou
18. Zuojin Tang
19. Canran Xiao
20. Sheng Xu
21. Zhenjun Zhao
22. Xiangyang Xue
23. Sicheng Zhao
24. Yong Dai
25. Tomoki Toda
26. Licai Sun
27. Kailai Yang
28. Liyun Zhang
29. Cong Cai
30. Jiamin Du
31. Ziyang Ma
32. Mingjie Chen
33. Chengxuan Qian
34. Zhenlong Yuan
35. Xie Chen
36. Huy Phan
37. Lin Wang
38. Björn Schuller
39. Joshua Reiss
This article has no evaluationsLatest version Mar 30, 2026
A Spectrogram and Local Feature-Assisted Convolutional Neural Network for Amharic Speech Emotion Identification

This article has 5 authors:
1. Yeshambel Asmare Mengist
2. Abrham Debasu Mengistu
3. Mulatu Yirga Beyene
4. Mikiyas Assefa Kassa
5. Getasew Asmare Mengist
This article has no evaluationsLatest version Mar 25, 2026
TAC-Net:Triple Attention Contrastive Network for Speech Complex Emotion Recognition in Real-Scene

This article has 3 authors:
1. Hankiz Yilahun
2. Chaobo Song
3. Askar Hamdulla
This article has no evaluationsLatest version Feb 26, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

A Comprehensive Review in Unimodal and Multimodal Emotion Recognition

A Spectrogram and Local Feature-Assisted Convolutional Neural Network for Amharic Speech Emotion Identification

TAC-Net:Triple Attention Contrastive Network for Speech Complex Emotion Recognition in Real-Scene