Enhancing Multimodal Recommendation via Contrastive Self-Supervised Modality-Preserving Learning
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Multimodal recommendation systems have gained increasing attention for their ability to incorporate rich side information such as visual and textual features. However, a critical yet underexplored challenge is the insufficient preservation of modality-specific information during training, which can weaken the effectiveness of multimodal signals and limit recommendation accuracy. To address this limitation, we propose Contrastive Modality-Preserving Learning (CMPL), a novel framework that extends the state-of-the-art MONET architecture. CMPL introduces a before-and-after contrastive learning module that explicitly maximizes the mutual information between initial modality embeddings and their final representations, thereby ensuring stronger modality preservation. At the same time, a graph convolutional backbone captures high-order collaborative signals from the user–item interaction graph, while a target-aware attention mechanism adaptively emphasizes user preference patterns. This joint design allows CMPL to balance the preservation of modality cues with the exploitation of collaborative filtering signals. We conduct extensive experiments on two real-world Amazon datasets, Office and MenClothing, and results consistently show that CMPL outperforms competitive baselines, including MARIO and MONET, in terms of precision and recall. These findings validate both the effectiveness of our approach and further highlight the necessity of explicitly modeling modality preservation for robust multimodal recommendation.