Interpretable Multimodal Emotion Recognition in Counseling Dialogues via Factor Analysis and Gaussian Mixture Modeling

Keita Kiuchi
Kotaro Kashihara
Toshiki Takanabe
Hidehiro Umehara
Koushi Irizawa
Masahito Nakataki
Shunsuke Numata
Xin Kang
Minoru Yoshida
Kazuyuki Matsumoto

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Multimodal emotion recognition, integrating facial expressions and vocal features, is key to advancing human-computer interaction and mental healthcare. However, current deep learning models often lack interpretability, limiting their real-world applicability. In this study, we present a framework for emotion recognition (happy, sad, angry, neutral) that uncovers meaningful multimodal patterns. We collected online dialogue data from 99 participants, extracting facial features (OpenFace), acoustic descriptors (openSMILE), and deep audio embeddings (VGGish). Factor analysis (FA) was applied independently to each modality for dimensionality reduction, and Gaussian mixture modeling (GMM) on the combined factor scores revealed latent multimodal expression clusters. These cluster probabilities, along with participant covariates (e.g., Big Five traits; PHQ-9; GAD-7/SCAS), served as inputs to various classifiers, including XGBoost and random forest. SHAP analysis confirmed the interpretability of the clusters, illustrating how individual differences and covariates influenced emotion predictions. We identified 15 distinct expression clusters (e.g., “social smiles,” “inexpressive states,” “wry grins”), offering nuanced insights into affective displays. Although overall accuracy was modest—due to individual variability and label noise—the framework effectively highlights interpretable, fine-grained expressive patterns. This approach lays groundwork for transparent affective computing systems, such as empathetic conversational agents, emphasizing the importance of explainability in emotion-based applications.

Version published to 10.31219/osf.io/2udp8_v1 on OSF Preprints
Aug 24, 2025

Speaker-Aware Emotion Recognition in Dialogues via SemGloVe- BERT and Graph Attention Networks

This article has 2 authors:
1. Sakunthala Prabha K S
2. Suguna Marappan
This article has no evaluationsLatest version Sep 23, 2025
Emotion-Aware ResNet50V2: Enhancing Mental Health Detection through Facial Expression Analysis

This article has 5 authors:
1. Puspen Lahiri
2. Rohit Dey
3. Tithi Jana
4. Hiranmoy Roy
5. Debotosh Bhattacharjee
This article has no evaluationsLatest version Sep 1, 2025
GBV-Net: Hierarchical Fusion of Facial Expressions and Physiological Signals for Multimodal Emotion Recognition

This article has 4 authors:
1. Jiling Yu
2. Yandong Ru
3. Bangjun Lei
4. Hongming Chen
This article has no evaluationsLatest version Sep 2, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Speaker-Aware Emotion Recognition in Dialogues via SemGloVe- BERT and Graph Attention Networks

Emotion-Aware ResNet50V2: Enhancing Mental Health Detection through Facial Expression Analysis

GBV-Net: Hierarchical Fusion of Facial Expressions and Physiological Signals for Multimodal Emotion Recognition