A brain-inspired algorithm enhances automatic speech recognition performance in multi-talker scenes
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Modern automatic speech recognition (ASR) systems are capable of impressive performance recognizing clean speech but struggle in noisy, multi-talker environments, commonly referred to as the “cocktail party problem.” In contrast, many human listeners can solve this problem, suggesting the existence of a solution in the brain. Here we present a novel approach that uses a brain inspired sound segregation algorithm (BOSSA) as a preprocessing step for a state-of-the-art ASR system (Whisper). We evaluated BOSSA’s impact on ASR accuracy in a spatialized multi-talker scene with one target speaker and two competing maskers, varying the difficulty of the task by changing the target-to-masker ratio. We found that median word error rate improved by up to 54% when the target-to-masker ratio was low. Our results indicate that brain-inspired algorithms have the potential to considerably enhance ASR accuracy in challenging multi-talker scenarios without the need for retraining or fine-tuning existing state-of-the-art ASR systems.