Benchmarking Self-Supervised Speech Models on Multilingual Nigerian Speech

Omotayo Omoyemi
Ifeoluwa Oladeni

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Self-supervised speech models such as Whisper and wav2vec 2.0 have significantly advanced automatic speech recognition (ASR) performance for high-resource languages. However, their robustness and generalization to underrepresented African languages remain insufficiently studied. In this work, we present a systematic benchmark of modern self-supervised ASR models on a multilingual Nigerian speech corpus comprising English, Hausa, Igbo, and Yoruba. Using the Nigerian Common Voice dataset (158 hours), we evaluate zero-shot performance of pretrained models and compare it with supervised adaptation using fine-tuning of multilingual speech encoders. We report Word Error Rate (WER) and Character Error Rate (CER) across languages and analyze the effect of supervised adaptation and cross-language transfer. Our results show that zero-shot ASR performance is substantially degraded for Nigerian languages compared to widely represented benchmark languages. Supervised fine-tuning consistently improves recognition accuracy, although the magnitude of improvement varies across languages and depends on the compatibility between the pretrained checkpoint and the target language. In particular, adaptation from a Hausa-pretrained XLS-R model yields strong gains for Hausa but more limited improvements for Igbo, highlighting the importance of language-specific training data. These findings demonstrate that multilingual pretraining alone is insufficient for reliable ASR in underrepresented African languages and that supervised adaptation remains necessary for robust deployment. The study provides reproducible benchmarks for multilingual ASR evaluation in African contexts and offers practical guidance for adapting large-scale speech models to underrepresented languages.

Version published to 10.31224/6650
Mar 20, 2026

Reg2Bangla: An End-to-End Regional Speech Standardization

This article has 7 authors:
1. Samiul Basir Bhuiyan
2. Md Sazzad Hossain Adib
3. Mohammed Aman Bhuiyan
4. Aritra Islam Saswato
5. Ahmed Faizul Haque Dhrubo
6. Mohammad Ashrafuzzaman Khan
7. Mohammad Abdul Qayum
This article has no evaluationsLatest version Mar 17, 2026
Exploring Large Language Models for Multitask Learning in Bengali Text Classification

This article has 4 authors:
1. Md. Sajjad Hossain
2. Kawsar Ahmed
3. Suny Md Ashraf Khan
4. Mohammed Moshiul Hoque
This article has no evaluationsLatest version Apr 3, 2026
Towards High-Quality Machine Translation for Kokborok: A Low-Resource Tibeto-Burman Language of Northeast India

This article has 2 authors:
1. Badal Nyalang
2. Biman Debbarma
This article has no evaluationsLatest version Mar 31, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Reg2Bangla: An End-to-End Regional Speech Standardization

Exploring Large Language Models for Multitask Learning in Bengali Text Classification

Towards High-Quality Machine Translation for Kokborok: A Low-Resource Tibeto-Burman Language of Northeast India