3D markerless tracking of speech movements with submillimeter accuracy
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Speech movements are highly complex and require precise tuning of both spatial and timing of oral articulators to support intelligible communication. These properties also make measurement of speech movements challenging, often requiring extensive physical sensors placed around the mouth and face that are not easily tolerated by certain populations such as young children. Recent progress in machine learning-based markerless facial landmark tracking technology demonstrated its potential to provide lip tracking without the need for physical sensors, but whether such technology can provide submillimeter precision and accuracy in 3D remains unknown. Moreover, it is also unclear whether such technology can be applied to track speech movements in young children. Here, we developed a novel approach that integrates Shape Preserving Facial Landmarks with Graph Attention Networks (SPIGA), a facial landmark detector, and CoTracker, a transformer-based neural network model that jointly tracks dense points across a video sequence. We further examined and validated this novel approach by assessing its tracking precision and accuracy. The findings revealed that our approach that integrates SPIGA and CoTracker was more precise (≈ 0.15 mm in standard deviation) than SPIGA alone (≈ 0.35 mm). In addition, its 3D tracking performance was comparable to electromagnetic articulography (≈ 0.29 mm RMSE against simultaneously recorded articulograph data). Importantly, the approach performed similarly well across adults and young children ( i . e ., 3- and 4-year-olds). Because our framework is built upon open-source pretrained models that are fully trained, it promotes accessibility and open science while saving computing resources. Furthermore, given that this framework combines a landmark detection model (SPIGA) with a tracker model (CoTracker) to improve precision/accuracy, our novel approach serves as a proof-of-concept for enhancing the performance of a wide variety of commonly used markerless tracking applications in biology and neuroscience.
Author summary
In this work, we examined whether machine learning based markerless tracking is feasible for tracking 3D lip movements in adults and young children. We developed a novel approach that integrates a landmark detection model (SPIGA) with a tracker model (CoTracker). Our combined CoTracker-based approach demonstrated submillimeter precision and accuracy desired for speech kinematic recording. In addition, our approach does not involve training and validation for each population ( e . g ., young children vs. adults), saving time and computing resources. We foresee that the proposed general framework of fusing a landmark detection model with a tracker model can be generalized for a wide variety of tracking applications in biology and neuroscience that require high precision and accuracy, including studying cell behaviors, animal motions, and other types of human movements.