The Reliability Fallacy: How Label Ambiguity Undermines AI Hate Speech Detection

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Automated content moderation is a critical AI security task. However, models often fail in the nuanced, subjective task of distinguishing “hate” from “offensive” speech. The influential ‘HateXplain‘ benchmark attributed this poor performance to a lack of model explainability, proposing rationale-based training as a solution. In this paper, we challenge this premise. We hypothesize that the models’ unreliability stems from a more fundamental, unaddressed security flaw: a crisis of data integrity caused by high label ambiguity. The original dataset relies on a “majority vote” to assign groundtruth labels, which masks significant annotator disagreement and introduces noise. To test our hypothesis, we isolate this variable. We partition the ‘HateX- plain‘ dataset into two cohorts: (1) a “noisy” Majority-Label set (using standard 2-1 majority votes) and (2) a “clean” Pure-Label set (using only 3-0 unanimous-consensus votes). We then rigorously benchmark five models (Logistic Regression, Random Forest, LightGBM, GRU, and AL- BERT) on both datasets. Our results are conclusive. All models trained on the “Pure-Label” data achieved statistically significant and substantially higher performance. The ALBERT model’s weighted F1-score, for instance, rose from 0.7447 on the “noisy” data to 0.8126 on the “clean” data. This demonstrates that label ambiguity is a more dominant performance bottleneck than the architectural factors previously considered. We conclude that for building secure and reliable AI safety systems, addressing foundational data integrity and label consensus is a more critical challenge than model-level explainability.

Article activity feed