Multimodal Denoising Recommendation Based on Confidence and Hierarchical Cross-Modal Alignment

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

In multimodal recommendation systems, modal conflict and noise interference significantly degrade model performance. While aligning different modalities via full-attention mechanisms partially mitigates modal conflict, it suffers from highcomputational complexity and disregards semantic hierarchies. On the other hand, conventional contrastive learning, though effective in suppressing noise, often lacks sufficient discriminative power to distinguish between residual noise and semantically meaningful features during denoising. To address these limitations, we propose a Multimodal Denoising Recommendation Based on Confidence and Hierarchical Cross-Modal Alignment (MDR-CHCA). This model designs a hierarchical cross-modal alignment module, which reduces computational complexity and generates fine-grained aligned features through two-stage process: global alignment between phrases and image regions, and fine-grained alignment between words and image regions. Furthermore, we introduce a confidence-weighted contrastive loss to dynamically select high-quality positive and negative pairs, thereby enhancing the model’s robustness against noise and its discriminative capability. Extensive experiments on three public datasets (Baby, Sports, and Clothing) validate the effectiveness and superiority of the proposed approach.

Article activity feed