Hate Speech Detection in Roman Urdu English Tweets Through Data Pre-processing

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Hate speech detection enhances internet safety by recognizing and reducing harmful or objectionable information. The increasing use of social media has made it more difficult to control and moderate hate speech. The casual and varied nature of material on Twitter presents a particular difficulty for hate speech identification because of its diversified and multilingual user base that includes code-mixed languages like Roman Urdu-English. To tackle the issue of Hate Speech in code mixed Roman Urdu-English little amount of research has been done by the NLP and machine learning community. To solve this problem, this article looks at how data pre-processing affects the ability to identify hate speech in tweets that combine Roman Urdu and English codes. We used a comprehensive 10-step data cleaning procedure followed by the Multilingual BERT (mBERT) model for Hate Speech detection. The methodology includes optimizing hyper parameters and carrying out comprehensive tests to evaluate model's accuracy. The results showed the proposed data preprocessing approach considerably increases accuracy. Compared to previous techniques, the mBERT model showed about 9.12% gain in accuracy. This demonstrates how well our pre-processing methods work and how powerful mBERT is in enhancing hate speech detection on social media networks.

Article activity feed