A Multi-Scale Adaptive Fusion Model for Multimodal Sarcasm Detection

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

This paper proposes a Multi-Scale Adaptive Fusion Sarcasm Detection Model (MSAF-SDM) to address the challenges of information complexity and insufficient inter-modal collaboration in multimodal sarcasm detection. The model integrates multi-level features from text, audio, and video modalities, leveraging a dynamic attention mechanism and an adaptive weight allocation strategy to capture sarcasm-related cues across modalities. To enhance feature extraction capabilities, the text modality employs a dual multi-scale dilated window attention mechanism, the audio modality utilizes multi-scale temporal convolution, and the video modality incorporates multi-scale spatiotemporal convolution reinforced by auxiliary modal features. Experimental results demonstrate that MSAF-SDM achieves an accuracy of 89.04% and an F1-score of 87.68% on public datasets, significantly outperforming existing state-of-the-art models. Ablation studies further validate the effectiveness of the multimodal feature extraction and adaptive fusion mechanisms. This research provides a novel approach for tackling multimodal sarcasm detection tasks.

Article activity feed