Classification of Communication and Head Movement Behaviors during Multi-Person Conversations using Deep Learning

A Earley
A Chhabra
EJ Ozmeral
A Lertpoompunya
DA Eddins
NC Higgins

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Head movements play a pivotal role while engaged in multi-talker conversation by providing non-verbal feedback to partners and enhancing a listener’s ability to separate sound sources. Commercial hearing-aids with an on-board IMU (Inertial Measurement Unit, i.e., accelerometers) typically use that information for step-counting and activity levels. At least one device uses it as an input to their environment classifier and integrated directional microphone. None, however, use the IMU to detect specific patterns of head movements to predict behaviors such as nodding, head shaking, listening to a person versus a video, or talking. Training an automatic classifier to accurately detect these behaviors first requires collecting head movement data during multi-person conversation and laboriously annotating each behavior-type for each participant with high temporal precision. From that point, with the goal of training the most accurate model and integrating with the hardware in the hearing aid to improve device performance, the question is how best to model the data. To address this gap, we collected accelerometer data during natural multi-person conversations and paired it with detailed human annotations of communication and head-movement behaviors. Head-movement data was collected from three cohorts of young, normal-hearing individuals (three per cohort) in a controlled, conference-room setting during 50-minute multi-talker conversations. Participants wore hearing aids with on-board accelerometers, and audio-video was recorded for each talker. Videos were manually annotated for communication and head-movement behaviors, including conversational turns and nonverbal cues such as tilts and nods. Temporal and spectral features were extracted from the accelerometer data (windowed into 1-second segments) corresponding to roll and pitch movements. These features, combined with the annotated behaviors, were used to train and test machine learning models. Models were trained on data from all but one participant and then tested on the held-out participant, repeating this procedure across all individuals. Several deep learning and classical machine learning models were compared for classifying communication behaviors (e.g., talking, listening, watching video) and head orientations (e.g., turning left or right, facing down, facing forward). More specifically, various sequence-to-sequence models, a state-of-the-art deep learning technique, were utilized. These models incorporated modern architectural components such as transformer networks. Multiple performance metrics were used to evaluate models, and results suggest that modern deep learning models outperform classical machine learning methods by significant margins. Classification performance improved further when temporal sequence information was incorporated. These results indicate that during multi-talker conversation, hearing-aid accelerometers can automatically classify stereotypical behaviors with high temporal resolution (1-second). Even when tested on unseen subjects, the models remained reliable. This establishes a foundation for more advanced approaches that combine behavioral and movement patterns, further integrate temporal dynamics, and incorporate additional inputs to improve accuracy and ecological validity.

Version published to 10.1101/2025.09.26.678869 on bioRxiv
Sep 29, 2025

Sensitivity of Non-Invasive Motor-Unit-Based Gesture Recognition to Signal Degradation

This article has 4 authors:
1. Mansour Taleshi
2. Dennis Yeung
3. Francesco Negro
4. Ivan Vujaklija
This article has no evaluationsLatest version Sep 16, 2025
ChildLens: An Egocentric Video Dataset for Activity Analysis in Children

This article has 5 authors:
1. Nele-Pauline Suffo
2. Pierre-Etienne Martin
3. Anas Suffo
4. Daniel Haun
5. Manuel Bohn
This article has no evaluationsLatest version Aug 12, 2025
Applying a Transformer-based machine-learning model to classify caregiver and infant behaviours during dyadic interactions.

This article has 3 authors:
1. Alexander Turner
2. Aly Magassouba
3. Sobanawartiny Wijeakumar
This article has no evaluationsLatest version Sep 25, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Sensitivity of Non-Invasive Motor-Unit-Based Gesture Recognition to Signal Degradation

ChildLens: An Egocentric Video Dataset for Activity Analysis in Children

Applying a Transformer-based machine-learning model to classify caregiver and infant behaviours during dyadic interactions.