From Substitution to Complementarity: Leveraging BERT-VITS2 and Real Speech for Better Chinese Dysarthric Speech Recognition

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Dysarthric Speech Recognition (DSR) is essential for improving communication for individuals with dysarthria, yet the scarcity of Chinese dysarthric speech data significantly limits system performance. This paper presents the first application of fine-tuned BERT-VITS2 for data augmentation in Chinese DSR, proposing a complementary training strategy combining synthetic and authentic dysarthric speech. The framework incorporates targeted modifications: (1) duration-aware attention and an extended stochastic duration predictor for irregular temporal patterns, (2) articulatory-constrained phoneme embeddings for distorted spectral characteristics, and (3) re-weighted loss functions balancing reconstruction fidelity and alignment accuracy. Experiments on the Chinese Dysarthric Speech Database (CDSD) and Mandarin Dysarthria Speech Corpus (MDSC) demonstrate approximately 30\% relative CER reduction compared to models trained solely on real data. Crucially, comprehensive acoustic analysis reveals that while synthetic speech effectively complements authentic data, it cannot fully substitute for real dysarthric speech due to limitations in capturing variability in pitch, loudness, and temporal patterns. These findings establish TTS-based augmentation as a complementary resource rather than a replacement for authentic data.

Article activity feed