Interpretable Multimodal Emotion Recognition in Counseling Dialogues via Factor Analysis and Gaussian Mixture Modeling
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Multimodal emotion recognition, integrating facial expressions and vocal features, is key to advancing human-computer interaction and mental healthcare. However, current deep learning models often lack interpretability, limiting their real-world applicability. In this study, we present a framework for emotion recognition (happy, sad, angry, neutral) that uncovers meaningful multimodal patterns. We collected online dialogue data from 99 participants, extracting facial features (OpenFace), acoustic descriptors (openSMILE), and deep audio embeddings (VGGish). Factor analysis (FA) was applied independently to each modality for dimensionality reduction, and Gaussian mixture modeling (GMM) on the combined factor scores revealed latent multimodal expression clusters. These cluster probabilities, along with participant covariates (e.g., Big Five traits; PHQ-9; GAD-7/SCAS), served as inputs to various classifiers, including XGBoost and random forest. SHAP analysis confirmed the interpretability of the clusters, illustrating how individual differences and covariates influenced emotion predictions. We identified 15 distinct expression clusters (e.g., “social smiles,” “inexpressive states,” “wry grins”), offering nuanced insights into affective displays. Although overall accuracy was modest—due to individual variability and label noise—the framework effectively highlights interpretable, fine-grained expressive patterns. This approach lays groundwork for transparent affective computing systems, such as empathetic conversational agents, emphasizing the importance of explainability in emotion-based applications.