Arabic SMS Spam Detection Using AraBERT and Dual Feature Extraction: A Study on Modern Standard and Iraqi Dialects
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The proliferation of spam messages in Short Message Service (SMS) communication, particularly in morphologically rich and dialectally diverse languages like Arabic, poses persistent challenges for traditional spam detection systems. This study proposes a novel hybrid deep learning framework integrating AraBERT embeddings with a dual-branch CNN-BiLSTM architecture to detect spam in Modern Standard Arabic (MSA) effectively and in under-resourced dialects such as Iraqi Arabic. A dialect-aware preprocessing pipeline is developed to normalize orthographic variations, preserve contextually meaningful dialectal expressions, and standardize non-linguistic artifacts common in informal texts. The model is evaluated on two datasets: an Arabic-translated version of the UCI SMS Spam Collection and a newly curated Iraqi Arabic SMS corpus. Experimental results demonstrate state-of-the-art performance, with the proposed model achieving 99.12% accuracy on the MSA dataset and 95.3% on the dialectal dataset, outperforming traditional and deep learning baselines. Comparative evaluations with other Arabic language models, including MARBERT, mBERT, CAMeLBERT, and QARiB, underscore the superior generalization of AraBERT across formal and dialectal variants. These findings highlight the importance of linguistically informed architectures for robust spam detection and contribute significantly to Arabic Natural Language Processing (NLP) by addressing the gaps in dialectal resource availability and model adaptability.