MultiAVSR: Robust Speech Recognition via Supervised Multi-Task Audio–Visual Learning

Shad Torrie
Kimi Wright
Dah-Jye Lee

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Speech recognition approaches typically fall into three categories: audio, visual, and audio–visual. Visual speech recognition, or lip reading, is the most difficult because visual cues are ambiguous and data is scarce. To address these challenges, we present a new multi-task audio–visual speech recognition, or MultiAVSR, framework for training a model on all three types of speech recognition simultaneously primarily to improve visual speech recognition. Unlike prior works which use separate models or complex semi-supervision, our framework employs a supervised multi-task hybrid Connectionist Temporal Classification/Attention loss cutting training exaFLOPs to just 18% of that required by semi-supervised multitask models. MultiAVSR achieves state-of-the-art visual speech recognition word error rate of 21.0% on the LRS3-TED dataset. Furthermore, it exhibits robust generalization capabilities, achieving a remarkable 44.7% word error rate on the WildVSR dataset. Our framework also demonstrates reduced dependency on external language models, which is critical for real-time visual speech recognition. For the audio and audio–visual tasks, our framework improves the robustness under various noisy environments with average relative word error rate improvements of 16% and 31%, respectively. These improvements across the three tasks illustrate the robust results our supervised multi-task speech recognition framework enables.

Version published to 10.3390/electronics14122310
Jun 6, 2025
Version published to 10.20944/preprints202505.0540.v1
May 8, 2025

Self-Supervised Audio Representation Learning Model Based on Time-Frequency Decoupling and Masked Reconstruction

This article has 3 authors:
1. Jie Xu
2. Yuhao Dai
3. Zhifeng Wang
This article has no evaluationsLatest version Dec 31, 2025
Deepfake Audio Detection Using Machine Learning and Deep Learning Methods

This article has 1 author:
1. Mainul Islam
This article has no evaluationsLatest version Jan 6, 2026
SAREC: A Semantic- Aware Retrieval-Augmented Conformer for Multilingual Low-Resource Speech Recognition

This article has 4 authors:
1. B. Arukiran Reddy
2. S. Udaya Bhaskar
3. J. Sunil Kumar
4. P. Raghunadh
This article has no evaluationsLatest version Dec 29, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Self-Supervised Audio Representation Learning Model Based on Time-Frequency Decoupling and Masked Reconstruction

Deepfake Audio Detection Using Machine Learning and Deep Learning Methods

SAREC: A Semantic- Aware Retrieval-Augmented Conformer for Multilingual Low-Resource Speech Recognition