Somali Dialect Identification: A Low-Resource Benchmark for MAXAA TIRI and MAAY Using Machine and Deep Learning

Abdifatah Ahmed Gedi
Yusuf Mohamed Ahmed
Shafie Abdi Mohamed
Yusuf Ahmed Yusuf
Abdénuur Umur Ebdiyow

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This study addresses the task of automatic dialect identification within the Somali language, focusing on its two primary dialects: MAXAA TIRI and MAAY. Somali, spoken by over 22 million individuals, presents significant dialectal diversity, which poses challenges for downstream NLP applications such as sentiment analysis, machine translation, and information retrieval. To bridge this gap, the study constructs and annotates a representative dataset of 3,011 text samples collected from diverse sources including social media and formal documents. The study evaluates a range of machine learning and deep learning models namely Naive Bayes, Support Vector Machines (SVM), and Bidirectional Long Short-Term Memory (BiLSTM)—to classify text based on dialectal features. Our results demonstrate high performance, with Naive Bayes and BiLSTM models achieving classification accuracies of 99.86% and 99.57%, respectively. To ensure generalizability, we apply rigorous validation methods, including cross-source testing and real-world deployment through a web-based interface. This research contributes a novel dataset, benchmarks several AI models for Somali dialect identification, and provides foundational insights for advancing low-resource language processing.

Version published to 10.21203/rs.3.rs-7163778/v1 on Research Square
Jul 22, 2025

Ekantipur-15Y: A Longitudinal Benchmark Corpus and Semantic Analysis of Nepali News (2010 - 2025)

This article has 2 authors:
1. Diwash Mainali
2. Utsav Mainali
This article has no evaluationsLatest version Mar 3, 2026
Reg2Bangla: An End-to-End Regional Speech Standardization

This article has 7 authors:
1. Samiul Basir Bhuiyan
2. Md Sazzad Hossain Adib
3. Mohammed Aman Bhuiyan
4. Aritra Islam Saswato
5. Ahmed Faizul Haque Dhrubo
6. Mohammad Ashrafuzzaman Khan
7. Mohammad Abdul Qayum
This article has no evaluationsLatest version Mar 17, 2026
Towards High-Quality Machine Translation for Kokborok: A Low-Resource Tibeto-Burman Language of Northeast India

This article has 2 authors:
1. Badal Nyalang
2. Biman Debbarma
This article has no evaluationsLatest version Mar 31, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Ekantipur-15Y: A Longitudinal Benchmark Corpus and Semantic Analysis of Nepali News (2010 - 2025)

Reg2Bangla: An End-to-End Regional Speech Standardization

Towards High-Quality Machine Translation for Kokborok: A Low-Resource Tibeto-Burman Language of Northeast India