From Patient Emotion Recognition to Provider Understanding: A Multimodal Data Mining Framework for Clinical Counseling Communication Analysis

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The computational analysis of therapeutic communication presents fundamental challenges in multi-label classification, severe class imbalance, and heterogeneous multimodal data integration. We introduce a comprehensive bidirectional framework that addresses patient emotion recognition and provider behavior analysis through advanced data mining techniques. For patient-side emotion recognition, we employ ClinicalBERT fine-tuned on human-annotated CounselChat comprising 1,482 counseling interactions across 25 emotion categories exhibiting class imbalance ratios reaching 60:1. Through frequency-stratified class weighting combined with dynamic per-class threshold optimization, we achieve macro-F1 of 0.74, representing a six-fold improvement over baseline multi-label approaches. Recognizing that patient emotion detection alone provides insufficient analytic utility, we extend our framework to provider-side behavior recognition using real-world psychotherapy sessions. We process 330 YouTube therapy sessions through an automated pipeline incorporating speaker diarization, automatic speech recognition, and temporal segmentation, yielding 14,086 annotated 10-second communication segments. Our provider-side architecture combines DeBERTa-v3-base for contextual text encoding with WavLM-base-plus for self-supervised audio representation learning, integrated through cross-modal attention mechanisms that learn content-dependent prosodic associations. On controlled human-annotated HOPE data comprising 178 sessions with approximately 12,500 utterances, the provider model achieves macro-F1 of 0.91 with Cohen's kappa of 0.87, comparable to inter-rater reliability reported among trained human annotators in psychotherapy process research, outperforming simple concatenation-based fusion by 12 percentage points. On automatically annotated YouTube data, the model achieves macro-F1 of 0.71, demonstrating feasibility of analyzing naturalistic clinical communication at scale while highlighting the performance gap between controlled and real-world scenarios.

Article activity feed