A Multi-Scale Adaptive Fusion Model for Multimodal Sarcasm Detection
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This paper proposes a Multi-Scale Adaptive Fusion Sarcasm Detection Model (MSAF-SDM) to address the challenges of information complexity and insufficient inter-modal collaboration in multimodal sarcasm detection. The model integrates multi-level features from text, audio, and video modalities, leveraging a dynamic attention mechanism and an adaptive weight allocation strategy to capture sarcasm-related cues across modalities. To enhance feature extraction capabilities, the text modality employs a dual multi-scale dilated window attention mechanism, the audio modality utilizes multi-scale temporal convolution, and the video modality incorporates multi-scale spatiotemporal convolution reinforced by auxiliary modal features. Experimental results demonstrate that MSAF-SDM achieves an accuracy of 89.04% and an F1-score of 87.68% on public datasets, significantly outperforming existing state-of-the-art models. Ablation studies further validate the effectiveness of the multimodal feature extraction and adaptive fusion mechanisms. This research provides a novel approach for tackling multimodal sarcasm detection tasks.