DDCAF: Dynamic Dual Cross-Attention Fusion Framework for Multimodal Hate Speech Detection

Gauri Kitukale
Navneet Pratap Singh
Sidharth Quamara

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The rapid spread of offensive and violent speech over social media platforms adds a significant threat to community harmony. In particular, hateful memes are multimodal artifacts that combine images with text to convey implicit or sarcastic hate cues and remain difficult to detect, as they often bypass traditional unimodal detection methods. To address this problem, we proposed DDCAF (Dynamic Dual Cross-Attention Fusion), a novel multimodal framework for detecting hate speech that integrates profound semantic comprehension from both visual and textual modalities. It uses a dual-stream architecture consisting of a RoBERTa-based text encoder and a Vision Transformer (ViT)-based image encoder. Using a bidirectional cross-attention mechanism, the model dynamically computes text-guided visual attention and visual-guided text attention, enabling it to prioritize semantically aligned features across modalities. The dynamic attention-driven fusion mechanism is able to identify subtle, context-dependent cues of hate intent.The proposed framework is primarily tested on multimodal benchmark datasets such as Hateful Memes and MMHS150K, while also assessing unimodal baselines (HateEval and OLID). The experimental findings reveal that it surpasses existing approaches and provides an accuracy of 89.35% on Hateful Memes and 91.20% on MMHS150K. Furthermore, ablation studies are carried out to demonstrate the impact of the subcomponent within the DDCAF, underscoring the importance of dynamic adaptive gating in capturing intermodal dependencies.

Version published to 10.21203/rs.3.rs-7506109/v1 on Research Square
Sep 15, 2025

DCA-CL: Enhancing Multimodal Emotion Recognition via Dual Cross Attention and Contrastive Learning

This article has 5 authors:
1. Xin Wang
2. Shubo Liu
3. Hongshe Dang
4. Longlong Qiao
5. Hongnian Yu
This article has no evaluationsLatest version Aug 27, 2025
A Deep Learning Framework for Emotion Recognitionin Music Using Multimodal Data Fusion

This article has 1 author:
1. Runhua Li
This article has no evaluationsLatest version Sep 19, 2025
Deep Temporal Features and Multi-Level Cross-Modal Attention Fusion for Multimodal Sentiment Analysis

This article has 1 author:
1. Min Zhu
This article has no evaluationsLatest version Sep 11, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

DCA-CL: Enhancing Multimodal Emotion Recognition via Dual Cross Attention and Contrastive Learning

A Deep Learning Framework for Emotion Recognitionin Music Using Multimodal Data Fusion

Deep Temporal Features and Multi-Level Cross-Modal Attention Fusion for Multimodal Sentiment Analysis