Graph-Based Cross-Modal Transformer Framework for Detecting Hate Speech in Social Media Streams

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The growth of multi-modal content, which often incorporates text, images, audio, and video, makes hate speech identification in social media more challenging. Implicit, symbolic, or context-driven hate speech is difficult for traditional unimodal or early-fusion models to capture. The proposed GXMT-HSD (Graph-aware Cross-Modal Transformer for Hate Speech Detection) architecture combines transformer-based reasoning with graph-based representation learning to overcome this issue by capturing deep semantic linkages across modalities. This novel research models each modality as a semantic graph, where BERT/RoBERTa encodes text with syntactic dependency parsing, Vision Transformers process images with object-region-context associations, MFCCs represent audio with affective emotion embeddings, and 3D CNNs capture scene transitions in video. A Multi-Head Graph Attention Network (MH-GAT) dynamically aligns these modality-specific graphs, achieving inter-modal semantic fusion by prioritising cross-modal relevance and structural coherence. Finally, the fused embeddings are processed for final classification by a Hierarchical Transformer Decoder, which generates attention-based explanations to make them more understandable. The capacity to model latent relationships across modalities and identify complex hate speech patterns, which standalone or pipeline models often overlook, justifies this highly integrated methodology. The proposed GXMT-HSD framework achieves 98.5% accuracy and 97.9% F1-score, ensuring balanced precision and recall in detecting complex hate speech. It demonstrates 97.6% robustness against noisy or adversarial inputs and a 97.1% fidelity score, confirming the reliability of its attention-based explanations. With 94.2% inference efficiency and 94.0% explanation compactness, GXMT-HSD is optimized for real-time, large-scale content moderation while maintaining transparency and trust.

Article activity feed