Cross-Modal Local Interest Contrast with Dual-Graph Denoising for Multimodal Recommendation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Multimodal recommender systems improve recommendation accuracy by incorporating item multimodal features (e.g., text, images) alongside user-item interactions. However, they face two critical challenges: (1) local user interests in multimodal features are often obscured by irrelevant content (e.g., background clutter in product images), and (2) behavioral data contains pervasive low-credibility interactions (e.g., accidental clicks) that propagate noise through graph-based recommenders. Notably, over-reliance on region-of-interest (ROI) features during graph construction may introduce spurious edges by ignoring global contextual relationships, exacerbating semantic distortion. To address these issues, we propose CLID (Cross-modal Local Interest Denoising), a novel framework integrating C ross-Modal L ocal I nterest Contrast and Dual-Graph D enoising. First, our local interest contrast mechanism employs text-guided visual attention alignment and a contrastive loss function to enhance discriminative local features—for example, it learns to focus on "sleeve design" in clothing images while suppressing unrelated background features. Crucially, it adaptively weights local features against global representations to prevent ROI-induced bias. Second, the dual-graph denoising architecture combines: (i) a local graph that stabilizes neighbor aggregation via structural consistency to attenuate noisy interactions and (ii) a hypergraph capturing group-wise behavioral patterns, where high-confidence interactions are reinforced through co-occurrence frequency weighting. Experiments demonstrate that CLID significantly improves recommendation performance on three Amazon review datasets: Baby; Clothing, Shoes and Jewelry; as well as Sports and Outdoors. The proposed CLID framework provides a generalizable contrast-and-denoise paradigm for robust multimodal recommendation, effectively bridging fine-grained feature enhancement with noise-resilient graph learning.The code implementation is openly available at the following repository: https://github.com/Qiyx5025/CLID-master.