MultiAVSR: Robust Speech Recognition via Supervised Multi-Task Audio–Visual Learning

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Speech recognition approaches typically fall into three categories: audio, visual, and audio–visual. Visual speech recognition, or lip reading, is the most difficult because visual cues are ambiguous and data is scarce. To address these challenges, we present a new multi-task audio–visual speech recognition, or MultiAVSR, framework for training a model on all three types of speech recognition simultaneously primarily to improve visual speech recognition. Unlike prior works which use separate models or complex semi-supervision, our framework employs a supervised multi-task hybrid Connectionist Temporal Classification/Attention loss cutting training exaFLOPs to just 18% of that required by semi-supervised multitask models. MultiAVSR achieves state-of-the-art visual speech recognition word error rate of 21.0% on the LRS3-TED dataset. Furthermore, it exhibits robust generalization capabilities, achieving a remarkable 44.7% word error rate on the WildVSR dataset. Our framework also demonstrates reduced dependency on external language models, which is critical for real-time visual speech recognition. For the audio and audio–visual tasks, our framework improves the robustness under various noisy environments with average relative word error rate improvements of 16% and 31%, respectively. These improvements across the three tasks illustrate the robust results our supervised multi-task speech recognition framework enables.

Article activity feed