Multimodal Denoising Recommendation Based on Confidence and Hierarchical Cross-Modal Alignment
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In multimodal recommendation systems, modal conflict and noise interference significantly degrade model performance. While aligning different modalities via full-attention mechanisms partially mitigates modal conflict, it suffers from highcomputational complexity and disregards semantic hierarchies. On the other hand, conventional contrastive learning, though effective in suppressing noise, often lacks sufficient discriminative power to distinguish between residual noise and semantically meaningful features during denoising. To address these limitations, we propose a Multimodal Denoising Recommendation Based on Confidence and Hierarchical Cross-Modal Alignment (MDR-CHCA). This model designs a hierarchical cross-modal alignment module, which reduces computational complexity and generates fine-grained aligned features through two-stage process: global alignment between phrases and image regions, and fine-grained alignment between words and image regions. Furthermore, we introduce a confidence-weighted contrastive loss to dynamically select high-quality positive and negative pairs, thereby enhancing the model’s robustness against noise and its discriminative capability. Extensive experiments on three public datasets (Baby, Sports, and Clothing) validate the effectiveness and superiority of the proposed approach.