A Noise-Robust End-to-End Framework for Amharic Speech Recognition
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
End-to-end automatic speech recognition (ASR) offers a streamlined alternative to traditional systems that rely on multiple, separately trained language, acoustic, and pronunciation models. In this paper, we present a noise-robust, end-to-end ASR framework tailored specifically to the Amharic language. Our approach integrates a convolutional neural network (CNN), a recurrent neural network (RNN), and Connectionist Temporal Classification (CTC) to directly transcribe speech into text—bypassing the need for labor-intensive dictionary creation. We evaluate our method on a large corpus of 20,000 noisy Amharic utterances, achieving a word error rate (WER) of just 7%. This result underlines the effectiveness of our system in handling challenging acoustic conditions. By reducing complexity and manual overhead, our end-to-end model offers a practical and accurate solution for real-world deployments, with broader implications for developing ASR in other low-resource and noise-prone environments.