A Deep Learning Framework for Audio Data Augmentation to Promote Linguistic Diversity
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In language technology, sustaining linguistic diversity is a critical challenge due to the lack of sufficient speech data for underrepresented languages and dialects. This paper addresses this scarcity by proposing a generative audio model designed to synthesize realistic speech samples for these languages. By leveraging a Convolutional Neural Network (CNN), our approach converts audio waveforms into spectrograms, treating them as 2D images for classification. We use a subset of the Speech Commands dataset to demonstrate the methodology, which involves preprocessing audio into fixed-length samples, converting them to spectrograms using the Short-Time Fourier Transform (STFT), and training a CNN to recognize voice commands. The trained model achieves a test accuracy of approximately 88.7%, indicating its efficacy in classifying distinct audio commands. This project lays the groundwork for creating a synthetic data pipeline that can augment limited datasets, thereby advancing speech recognition capabilities for endangered and less-resourced languages and promoting a more inclusive and sustainable linguistic landscape.