Real-Time Audio–Visual Emotion Detection for Human–AI Interaction Using a Cross-Modal Transformere

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The real-timeness, multi-modality, and actuality of the assessments are aimed at improving the human-to-artificial intelligence interaction in the context of a virtual learning environment using continuous observation and assessment of the learners’ facial expressions and speech. A multi-modal XMT architecture is developed so as to process visual features from the CNN-based face video frames and acoustic features from the speech using MFCCs jointly by projecting both into a common embedding space and integrating over multiple heads of self-attention along with the feedforward layers to account for temporal dependencies within a modality as well as the relationships across the modalities. Model training and evaluation was conducted on a merged corpus consisting of the CREMA-D and RAVDESS datasets using the 6 basic emotions of anger, disgust, fear, happiness, neutrality and sadness based on a balanced sample of 240 pairs of recorded samples across different speakers, conditions, and intensities of emotions (240 samples total). The results show that the multimodal XMT outperforms both the audio-only and video-only baseline systems consistently with an average recognition accuracy of 73% and an average macro precision, recall, and F1-score of 76%, 75% and 75%, respectively, while maintaining an endto- end latency of less than 1.5 seconds, allowing real-time interactivity on standard computing platforms. The emotion recognition framework contributes to this integrated virtual assistant environment by providing feedback that matches the learner’s emotion by changing tone, style of feedback, and manner of instruction. When a learner exhibits frustration or low self-confidence, additional support or encouragement will be provided through alternative means of engagement. Conversely, when learners are fully engaged, they will be provided with additional challenge as well as motivation through the content provided to them. Additionally, because of the robustness of multimodal fusion, should one mode of communication become partially ineffective—for instance, when an audio signal is degraded due to background noise or low lighting conditions—the assistant will still be able to effectively provide an accurate solution and continue to provide stable affective sensing information during the learner’s use. These results indicate not only the versatility of transformer-based multimodal architectures and curated audio-visual corpora in providing technical solutions for developing affect-aware personalized tutoring systems, but also that these technologies can contribute to the development of other emotionally intelligent human-AI interfaces and systems used in education and other interactive areas.

Article activity feed