Arabic SMS Spam Detection Using AraBERT and Dual Feature Extraction: A Study on Modern Standard and Iraqi Dialects

Hussein Alkaabi
Fuqdan Ibraheemi
Ali Jasim
Zainab S. Idan Idan
Ahmed Rahi Alhelal

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The proliferation of spam messages in Short Message Service (SMS) communication, particularly in morphologically rich and dialectally diverse languages like Arabic, poses persistent challenges for traditional spam detection systems. This study proposes a novel hybrid deep learning framework integrating AraBERT embeddings with a dual-branch CNN-BiLSTM architecture to detect spam in Modern Standard Arabic (MSA) effectively and in under-resourced dialects such as Iraqi Arabic. A dialect-aware preprocessing pipeline is developed to normalize orthographic variations, preserve contextually meaningful dialectal expressions, and standardize non-linguistic artifacts common in informal texts. The model is evaluated on two datasets: an Arabic-translated version of the UCI SMS Spam Collection and a newly curated Iraqi Arabic SMS corpus. Experimental results demonstrate state-of-the-art performance, with the proposed model achieving 99.12% accuracy on the MSA dataset and 95.3% on the dialectal dataset, outperforming traditional and deep learning baselines. Comparative evaluations with other Arabic language models, including MARBERT, mBERT, CAMeLBERT, and QARiB, underscore the superior generalization of AraBERT across formal and dialectal variants. These findings highlight the importance of linguistically informed architectures for robust spam detection and contribute significantly to Arabic Natural Language Processing (NLP) by addressing the gaps in dialectal resource availability and model adaptability.

Version published to 10.21203/rs.3.rs-6832100/v1 on Research Square
Jun 10, 2025

Detecting Machine-Generated Arabic Text Using AraBERT and LSTM: Toward Trustworthy NLP in Low-Resource Languages

This article has 3 authors:
1. Tarek Barhoum
2. Mina Ibrahim
3. Mohamad Al Bali
This article has no evaluationsLatest version Aug 8, 2025
Somali Dialect Identification: A Low-Resource Benchmark for MAXAA TIRI and MAAY Using Machine and Deep Learning

This article has 5 authors:
1. Abdifatah Ahmed Gedi
2. Yusuf Mohamed Ahmed
3. Shafie Abdi Mohamed
4. Yusuf Ahmed Yusuf
5. Abdénuur Umur Ebdiyow
This article has no evaluationsLatest version Jul 22, 2025
Towards Secure Social Platforms: Hate Speech Detection and Classification in Indian Languages Using Hybrid Soft Computing Techniques

This article has 1 author:
1. Purbani Kar
This article has no evaluationsLatest version Jul 25, 2025

Listed in

Abstract

Article activity feed

Related articles

Detecting Machine-Generated Arabic Text Using AraBERT and LSTM: Toward Trustworthy NLP in Low-Resource Languages

Somali Dialect Identification: A Low-Resource Benchmark for MAXAA TIRI and MAAY Using Machine and Deep Learning

Towards Secure Social Platforms: Hate Speech Detection and Classification in Indian Languages Using Hybrid Soft Computing Techniques