Adaptive Latent Interaction Reasoning for Multimodal Misinformation Analysis
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The rapid growth of online social platforms has fundamentally transformed the way information is produced, disseminated, and consumed, while simultaneously amplifying the societal impact of misleading and fabricated content. In response to this challenge, multimodal fake news detection has emerged as a critical research problem, aiming to jointly leverage textual and visual signals embedded in social media posts. Existing methods predominantly rely on direct fusion of unimodal representations or shallow cross-modal interactions, which often fail to explicitly model the semantic alignment and latent inconsistencies across modalities. In particular, the potential of contrastive learning paradigms for learning robust and semantically grounded multimodal representations in fake news scenarios remains underexplored. In this work, we introduce ALIGNER, an Adaptive Latent Interaction Guided coNtrastivE Reasoning framework designed for multimodal fake news detection. ALIGNER adopts a dual-encoder architecture to learn modality-specific semantic representations and employs cross-modal ontrastive learning to explicitly align visual and textual semantics. To address the inherent noise and ambiguity of image–text associations in real-world fake news data, wefurther propose a latent consistency objective that relaxes the rigid one-hot supervision imposed by conventional contrastive losses. This auxiliary learning signal enables the model to capture fine grained semantic relatedness among unpaired or weakly related multimodal samples. Building upon the aligned unimodal features, ALIGNER incorporates a dedicated cross-modal interaction module to capture higher-order correlations between visual and linguistic representations. Moreover, wedesign an attention-based aggregation mechanism equipped with an explicit guidance signal to adaptively weigh the contributions of different modalities during decision making, thereby enhancing both effectiveness and interpretability. Extensive experiments conducted on two widely adopted benchmarks, Twitter and Weibo, demonstrate that ALIGNER consistently surpasses existing state of-the-art approaches by a substantial margin, highlighting the advantages of adaptive contrastive reasoning for multimodal fake news detection.