Non-Salient Visual Content Grounding for Multimodal Relation Extraction
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Multimodal Relation Extraction (MRE) seeks to identify relations between textual entities with visual context. However, current models struggle when visual information is non-salient or weakly relevant, primarily due to two factors: (1) independent visual feature extraction often prioritizes biased semantics for salient visual contents, hindering attention to non-salient details; (2) classifier-based methods struggle to filter out spurious multimodal correlations, particularly in scenarios involving non-salient visual cues. To address these issues, we propose a N on- S alient V isual C ontent G rounding ( NSVCG ) Network that leverages instruction-following Multimodal Large Language Models (MLLMs) with entity-guided prompts to extract relevant but non-salient visual features. Further, a conditional diffusion mechanism iteratively refines predictions by eliminating spurious multimodal correlations. As the test bed, we introduce a new dataset, VG-MNRE, extending MNRE test set with 1614 samples and manually annotated grounding labels. Experimental results show that NSVCG outperforms state-of-the-art baselines by 1.2% F1 on MNRE and 6.76% grounding accuracy on VG-MNRE, demonstrating improved robustness and relevant visual content grounding ability. The code will be released upon publication.