Can large language models effectively reshape online implicit hate speech? An integrative modelling approach

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Implicit Hate Speech (IHS) presents major challenges for traditional content governance. Using large language models (LLMs) for reshaping has become a promising new approach. This study evaluates the ability, strategies, and risks of LLMs in reshaping IHS. We use an integrative modelling method. First, we build a predictive model to measure the external effect of LLM reshaping. Second, we design an explainable evaluation framework with four dimensions: group-specific harm, implicit emotional expression, linguistic obfuscation and extremity, and bias and implication in social interaction. Results show that LLMs (e.g., GPT-4o and DeepSeek) can strongly reshape IHS texts in topics such as threatening and inferiority, reducing toxicity by 86.2%–90.57% while keeping high semantic similarity (BERTScore F1: 82%–85%). However, reshaping is not full detoxification. It often replaces risk with new covert forms. Explicit attacks are reduced, but covert risks may appear through strategies like vague references, hiding emotions, or adding logical gaps. This study confirms the value of LLMs in IHS governance, but also reveals their “replace-rather-than-remove” pattern. The framework we propose is a useful tool to detect and manage covert risks caused by algorithms, offering both theoretical and practical guidance for creating a more civil online space.

Article activity feed