BERT-Based Model for Identifying Hate Speech and Offensive Language in Arabic Social Media
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Nowadays, addressing hate speech has become a major social and political concern due to its harmful impact. It is essential to develop detection techniques, given the evolving expression patterns on social media platforms. The focus is on natural language processing, which analyzes complex texts effectively. Arabic Natural Language Processing presents significant challenges due to the complexity of the language and the limited availability of high-quality data. This paper investigates the use of the Bidirectional Encoder Representation from Transformers (BERT) model to detect hate speech and classify data using machine learning (ML) algorithms. Platform X (Twitter) was chosen as the primary data source due to the short text format which highlights the challenges of text processing and hate speech detection. The study seeks to assess the performance, robustness, and stability of the BERT model across different dataset sizes, along with the effectiveness of the selected ML algorithms. Principal Component Analysis (PCA) was applied to reduce dimensionality and yielded positive results. Three versions of the BERT model and three dataset sizes were used to achieve the study’s objectives. Experimental results showed that classification performance remained relatively stable across different dataset sizes, indicating that BERT models are robust and scalable, with minimal performance degradation even on small datasets. The Support Vector Machine algorithm performed best in most scenarios, reaching 82% accuracy with the MARBERT model on the small dataset, along with 81% for F1, Precision, and Recall. The Random Forests algorithm yielded 81%, 79%, 82%, and 78% for accuracy, F1, Precision, and Recall, respectively, using the ARABERT model on the small dataset. However, Support Vector Machine outperformed other models overall.