Somali Dialect Identification: A Low-Resource Benchmark for MAXAA TIRI and MAAY Using Machine and Deep Learning
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This study addresses the task of automatic dialect identification within the Somali language, focusing on its two primary dialects: MAXAA TIRI and MAAY. Somali, spoken by over 22 million individuals, presents significant dialectal diversity, which poses challenges for downstream NLP applications such as sentiment analysis, machine translation, and information retrieval. To bridge this gap, the study constructs and annotates a representative dataset of 3,011 text samples collected from diverse sources including social media and formal documents. The study evaluates a range of machine learning and deep learning models namely Naive Bayes, Support Vector Machines (SVM), and Bidirectional Long Short-Term Memory (BiLSTM)—to classify text based on dialectal features. Our results demonstrate high performance, with Naive Bayes and BiLSTM models achieving classification accuracies of 99.86% and 99.57%, respectively. To ensure generalizability, we apply rigorous validation methods, including cross-source testing and real-world deployment through a web-based interface. This research contributes a novel dataset, benchmarks several AI models for Somali dialect identification, and provides foundational insights for advancing low-resource language processing.