The Reliability Fallacy: How Label Ambiguity Undermines AI Hate Speech Detection
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Automated content moderation is a critical AI security task. However, models often fail in the nuanced, subjective task of distinguishing “hate” from “offensive” speech. The influential ‘HateXplain‘ benchmark attributed this poor performance to a lack of model explainability, proposing rationale-based training as a solution. In this paper, we challenge this premise. We hypothesize that the models’ unreliability stems from a more fundamental, unaddressed security flaw: a crisis of data integrity caused by high label ambiguity. The original dataset relies on a “majority vote” to assign groundtruth labels, which masks significant annotator disagreement and introduces noise. To test our hypothesis, we isolate this variable. We partition the ‘HateX- plain‘ dataset into two cohorts: (1) a “noisy” Majority-Label set (using standard 2-1 majority votes) and (2) a “clean” Pure-Label set (using only 3-0 unanimous-consensus votes). We then rigorously benchmark five models (Logistic Regression, Random Forest, LightGBM, GRU, and AL- BERT) on both datasets. Our results are conclusive. All models trained on the “Pure-Label” data achieved statistically significant and substantially higher performance. The ALBERT model’s weighted F1-score, for instance, rose from 0.7447 on the “noisy” data to 0.8126 on the “clean” data. This demonstrates that label ambiguity is a more dominant performance bottleneck than the architectural factors previously considered. We conclude that for building secure and reliable AI safety systems, addressing foundational data integrity and label consensus is a more critical challenge than model-level explainability.