A Deep Learning Framework for Audio Data Augmentation to Promote Linguistic Diversity

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

In language technology, sustaining linguistic diversity is a critical challenge due to the lack of sufficient speech data for underrepresented languages and dialects. This paper addresses this scarcity by proposing a generative audio model designed to synthesize realistic speech samples for these languages. By leveraging a Convolutional Neural Network (CNN), our approach converts audio waveforms into spectrograms, treating them as 2D images for classification. We use a subset of the Speech Commands dataset to demonstrate the methodology, which involves preprocessing audio into fixed-length samples, converting them to spectrograms using the Short-Time Fourier Transform (STFT), and training a CNN to recognize voice commands. The trained model achieves a test accuracy of approximately 88.7%, indicating its efficacy in classifying distinct audio commands. This project lays the groundwork for creating a synthetic data pipeline that can augment limited datasets, thereby advancing speech recognition capabilities for endangered and less-resourced languages and promoting a more inclusive and sustainable linguistic landscape.

Article activity feed