Performance and biases of the LENA® and ACLEW algorithms in analyzing language environments in Down, Fragile X, Angelman syndromes, and populations at elevated likelihood for autism

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Wearable recorders are being used in research and clinical practice to collect and measure children’s vocalizations and language environments. Recordings generate vast amounts of audio, making manual analysis impractical and requiring automated processing. Two automated algorithms have emerged: the proprietary LENA® and open-source ACLEW systems, yet systematic performance comparisons remain scarce. Here, we validate and compare the performance of these two algorithms across key measures: audio segmentation into speaker categories, Conversational Turn Count, Adult Word Count, and Child Vocalization Count. This analysis is based on 25 hours of manually annotated audio recordings from 50 age-matched U.S. children with diverse neurodevelopmental profiles: children with Down, Fragile X, and Angelman syndromes, children at elevated likelihood of autism, and low-risk controls. We hypothesized that the algorithms might be less accurate for children with neurodevelopmental conditions, since these children often show different patterns of volubility and vocal maturity compared to the typically developing children used to train the algorithms. Thus, we assessed algorithms’ performance across diagnostic groups, a crucial validation step for both cross-population research and the evaluation of language interventions. Results reveal that while algorithms achieve similar performance, they show different patterns: LENA® makes fewer segmentation mistakes but misses many segments (identification error rate = 81.3%, percent correct = 45.3%), while ACLEW shows the opposite pattern (identification error rate = 129.4%, percent correct = 69.4%). Both LENA® and ACLEW achieve reasonable levels of accuracy in their automatic counts (Pearson’s r ranging from .78 to .92) and maintain stable performance across diagnostic groups. We conclude with recommendations for the validation and potential use of these algorithms in research and clinical practice.

Article activity feed