BERT-Based Model for Identifying Hate Speech and Offensive Language in Arabic Social Media

Aiman M. Ayyal Awwad
Farhan Alebeisat
Ra’dah A. Alsmeheen

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Nowadays, addressing hate speech has become a major social and political concern due to its harmful impact. It is essential to develop detection techniques, given the evolving expression patterns on social media platforms. The focus is on natural language processing, which analyzes complex texts effectively. Arabic Natural Language Processing presents significant challenges due to the complexity of the language and the limited availability of high-quality data. This paper investigates the use of the Bidirectional Encoder Representation from Transformers (BERT) model to detect hate speech and classify data using machine learning (ML) algorithms. Platform X (Twitter) was chosen as the primary data source due to the short text format which highlights the challenges of text processing and hate speech detection. The study seeks to assess the performance, robustness, and stability of the BERT model across different dataset sizes, along with the effectiveness of the selected ML algorithms. Principal Component Analysis (PCA) was applied to reduce dimensionality and yielded positive results. Three versions of the BERT model and three dataset sizes were used to achieve the study’s objectives. Experimental results showed that classification performance remained relatively stable across different dataset sizes, indicating that BERT models are robust and scalable, with minimal performance degradation even on small datasets. The Support Vector Machine algorithm performed best in most scenarios, reaching 82% accuracy with the MARBERT model on the small dataset, along with 81% for F1, Precision, and Recall. The Random Forests algorithm yielded 81%, 79%, 82%, and 78% for accuracy, F1, Precision, and Recall, respectively, using the ARABERT model on the small dataset. However, Support Vector Machine outperformed other models overall.

Version published to 10.21203/rs.3.rs-7402208/v1 on Research Square
Aug 26, 2025

Detection of Adult Content in Arabic Tweets Using Machine Learning Models

This article has 1 author:
1. Aram Ibrahim Al-anazi
This article has no evaluationsLatest version Sep 17, 2025
An Empirical Comparison of Ensemble model and Deep Learning Models for Multi-Level Arabic Fake News Classification using JoNewsFake Dataset

This article has 4 authors:
1. Noor M. Alkudah
2. Norisma Idris
3. Aznul Qalid Md Sa
4. Mohammad A. M. Abushariah
This article has no evaluationsLatest version Sep 16, 2025
NLP-based approach to multilingual fake news detection through social media in low-resource languages: A review

This article has 2 authors:
1. Sibgha Munir
2. Haris Munir
This article has no evaluationsLatest version Sep 9, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Detection of Adult Content in Arabic Tweets Using Machine Learning Models

An Empirical Comparison of Ensemble model and Deep Learning Models for Multi-Level Arabic Fake News Classification using JoNewsFake Dataset

NLP-based approach to multilingual fake news detection through social media in low-resource languages: A review