Benchmarking Self-Supervised Speech Models on Multilingual Nigerian Speech
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Self-supervised speech models such as Whisper and wav2vec 2.0 have significantly advanced automatic speech recognition (ASR) performance for high-resource languages. However, their robustness and generalization to underrepresented African languages remain insufficiently studied. In this work, we present a systematic benchmark of modern self-supervised ASR models on a multilingual Nigerian speech corpus comprising English, Hausa, Igbo, and Yoruba. Using the Nigerian Common Voice dataset (158 hours), we evaluate zero-shot performance of pretrained models and compare it with supervised adaptation using fine-tuning of multilingual speech encoders. We report Word Error Rate (WER) and Character Error Rate (CER) across languages and analyze the effect of supervised adaptation and cross-language transfer. Our results show that zero-shot ASR performance is substantially degraded for Nigerian languages compared to widely represented benchmark languages. Supervised fine-tuning consistently improves recognition accuracy, although the magnitude of improvement varies across languages and depends on the compatibility between the pretrained checkpoint and the target language. In particular, adaptation from a Hausa-pretrained XLS-R model yields strong gains for Hausa but more limited improvements for Igbo, highlighting the importance of language-specific training data. These findings demonstrate that multilingual pretraining alone is insufficient for reliable ASR in underrepresented African languages and that supervised adaptation remains necessary for robust deployment. The study provides reproducible benchmarks for multilingual ASR evaluation in African contexts and offers practical guidance for adapting large-scale speech models to underrepresented languages.