DDCAF: Dynamic Dual Cross-Attention Fusion Framework for Multimodal Hate Speech Detection

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The rapid spread of offensive and violent speech over social media platforms adds a significant threat to community harmony. In particular, hateful memes are multimodal artifacts that combine images with text to convey implicit or sarcastic hate cues and remain difficult to detect, as they often bypass traditional unimodal detection methods. To address this problem, we proposed DDCAF (Dynamic Dual Cross-Attention Fusion), a novel multimodal framework for detecting hate speech that integrates profound semantic comprehension from both visual and textual modalities. It uses a dual-stream architecture consisting of a RoBERTa-based text encoder and a Vision Transformer (ViT)-based image encoder. Using a bidirectional cross-attention mechanism, the model dynamically computes text-guided visual attention and visual-guided text attention, enabling it to prioritize semantically aligned features across modalities. The dynamic attention-driven fusion mechanism is able to identify subtle, context-dependent cues of hate intent.The proposed framework is primarily tested on multimodal benchmark datasets such as Hateful Memes and MMHS150K, while also assessing unimodal baselines (HateEval and OLID). The experimental findings reveal that it surpasses existing approaches and provides an accuracy of 89.35% on Hateful Memes and 91.20% on MMHS150K. Furthermore, ablation studies are carried out to demonstrate the impact of the subcomponent within the DDCAF, underscoring the importance of dynamic adaptive gating in capturing intermodal dependencies.

Article activity feed