Cross-Modal Local Interest Contrast with Dual-Graph Denoising for Multimodal Recommendation

Yuxin Qi
Quangui Zhang
Xinqiang Ma
Xie Feng
Qiang Li
Yi Huang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Multimodal recommender systems improve recommendation accuracy by incorporating item multimodal features (e.g., text, images) alongside user-item interactions. However, they face two critical challenges: (1) local user interests in multimodal features are often obscured by irrelevant content (e.g., background clutter in product images), and (2) behavioral data contains pervasive low-credibility interactions (e.g., accidental clicks) that propagate noise through graph-based recommenders. Notably, over-reliance on region-of-interest (ROI) features during graph construction may introduce spurious edges by ignoring global contextual relationships, exacerbating semantic distortion. To address these issues, we propose CLID (Cross-modal Local Interest Denoising), a novel framework integrating C ross-Modal L ocal I nterest Contrast and Dual-Graph D enoising. First, our local interest contrast mechanism employs text-guided visual attention alignment and a contrastive loss function to enhance discriminative local features—for example, it learns to focus on "sleeve design" in clothing images while suppressing unrelated background features. Crucially, it adaptively weights local features against global representations to prevent ROI-induced bias. Second, the dual-graph denoising architecture combines: (i) a local graph that stabilizes neighbor aggregation via structural consistency to attenuate noisy interactions and (ii) a hypergraph capturing group-wise behavioral patterns, where high-confidence interactions are reinforced through co-occurrence frequency weighting. Experiments demonstrate that CLID significantly improves recommendation performance on three Amazon review datasets: Baby; Clothing, Shoes and Jewelry; as well as Sports and Outdoors. The proposed CLID framework provides a generalizable contrast-and-denoise paradigm for robust multimodal recommendation, effectively bridging fine-grained feature enhancement with noise-resilient graph learning.The code implementation is openly available at the following repository: https://github.com/Qiyx5025/CLID-master.

Version published to 10.21203/rs.3.rs-7494692/v1 on Research Square
Sep 26, 2025

Structure-Activated and Interest-Aware Multimodal Recommendation Method

This article has 3 authors:
1. HaoYu Wang
2. HongBin Xia
3. XiaoFeng Wang
This article has no evaluationsLatest version Oct 16, 2025
Enhancing Multimodal Recommendation via Contrastive Self-Supervised Modality-Preserving Learning

This article has 2 authors:
1. Jiajie Lu
2. Yamashita Haruka
This article has no evaluationsLatest version Oct 27, 2025
Enhancing Cross-Modal Retrieval via Label Graph Optimization and Hybrid Loss Functions

This article has 3 authors:
1. Lin Wang
2. Chenchen Wang
3. Simin Peng
This article has no evaluationsLatest version Nov 11, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Structure-Activated and Interest-Aware Multimodal Recommendation Method

Enhancing Multimodal Recommendation via Contrastive Self-Supervised Modality-Preserving Learning

Enhancing Cross-Modal Retrieval via Label Graph Optimization and Hybrid Loss Functions