Classification of Communication and Head Movement Behaviors during Multi-Person Conversations using Deep Learning
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Head movements play a pivotal role while engaged in multi-talker conversation by providing non-verbal feedback to partners and enhancing a listener’s ability to separate sound sources. Commercial hearing-aids with an on-board IMU (Inertial Measurement Unit, i.e., accelerometers) typically use that information for step-counting and activity levels. At least one device uses it as an input to their environment classifier and integrated directional microphone. None, however, use the IMU to detect specific patterns of head movements to predict behaviors such as nodding, head shaking, listening to a person versus a video, or talking. Training an automatic classifier to accurately detect these behaviors first requires collecting head movement data during multi-person conversation and laboriously annotating each behavior-type for each participant with high temporal precision. From that point, with the goal of training the most accurate model and integrating with the hardware in the hearing aid to improve device performance, the question is how best to model the data. To address this gap, we collected accelerometer data during natural multi-person conversations and paired it with detailed human annotations of communication and head-movement behaviors. Head-movement data was collected from three cohorts of young, normal-hearing individuals (three per cohort) in a controlled, conference-room setting during 50-minute multi-talker conversations. Participants wore hearing aids with on-board accelerometers, and audio-video was recorded for each talker. Videos were manually annotated for communication and head-movement behaviors, including conversational turns and nonverbal cues such as tilts and nods. Temporal and spectral features were extracted from the accelerometer data (windowed into 1-second segments) corresponding to roll and pitch movements. These features, combined with the annotated behaviors, were used to train and test machine learning models. Models were trained on data from all but one participant and then tested on the held-out participant, repeating this procedure across all individuals. Several deep learning and classical machine learning models were compared for classifying communication behaviors (e.g., talking, listening, watching video) and head orientations (e.g., turning left or right, facing down, facing forward). More specifically, various sequence-to-sequence models, a state-of-the-art deep learning technique, were utilized. These models incorporated modern architectural components such as transformer networks. Multiple performance metrics were used to evaluate models, and results suggest that modern deep learning models outperform classical machine learning methods by significant margins. Classification performance improved further when temporal sequence information was incorporated. These results indicate that during multi-talker conversation, hearing-aid accelerometers can automatically classify stereotypical behaviors with high temporal resolution (1-second). Even when tested on unseen subjects, the models remained reliable. This establishes a foundation for more advanced approaches that combine behavioral and movement patterns, further integrate temporal dynamics, and incorporate additional inputs to improve accuracy and ecological validity.