Identifying Common Semantics Across Modalities via Contrastive Latent Alignment

Soren Whitaker
Wyne Nasir
Elowen Hart

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The rapid advances in multimodal representation learning have been significantly driven by contrastive learning strategies, especially in scenarios involving weak supervision and cross-modal correspondence, such as image-text retrieval or audiovisual reasoning. While successful instances like \texttt{CLIP} illustrate the practical effectiveness of contrastive objectives, the theoretical insights into what can be provably recovered from such methods remain incomplete. Prior studies primarily focus on multi-view settings, assuming uniform generative processes across modalities. In this work, we broaden this scope by investigating the identifiability problem in general heterogeneous multimodal settings, where each modality follows its own generative dynamics and encodes distinct, modality-specific latent representations. We introduce a new framework, termed \textbf{CIPHER} (\textbf{C}ontrastive \textbf{I}dentification of \textbf{P}aired \textbf{H}eterogeneous \textbf{E}ncodings via \textbf{R}econstruction), which extends previous identifiability analyses by modeling the multimodal generation process using distinct latent variables for each modality, transformed through nonlinear mixing functions. Our theoretical results establish that, under relatively mild conditions, contrastive learning objectives can still block-identify shared latent semantics, even when latent variables exhibit strong dependencies. Crucially, these identifiability guarantees hold in the presence of modality-specific noise and across structurally divergent generative mechanisms. We empirically validate our theoretical findings using a combination of synthetic simulations and real-world datasets involving paired image-text inputs. The results underscore the robustness and applicability of contrastive learning to complex multimodal generative models. Overall, our work offers a principled explanation for the success of contrastive paradigms in multimodal scenarios and deepens the theoretical foundation underpinning modern multimodal learning techniques.

Version published to 10.20944/preprints202507.0008.v1
Jul 1, 2025

Multigranular Unified Synthesis Encoder for Fine-grained Multimodal Emotion Understanding

This article has 3 authors:
1. Colton Ray
2. Wyne Nasir
3. Savannah Grace
This article has no evaluationsLatest version May 16, 2025
Disentangled Representation Learning with Temporal Smoothness Constraints for Multimodal Sentiment Analysis

This article has 2 authors:
1. Yihao Xu
2. Hai Huan
This article has no evaluationsLatest version May 14, 2025
NeuroConText: Contrastive Learning for Neuroscience Meta-Analysis with Rich Text Representation

This article has 5 authors:
1. Fateme Ghayem
2. Raphaël Meudec
3. Jérôme Dockès
4. Bertrand Thirion
5. Demian Wassermann
This article has no evaluationsLatest version May 27, 2025

Listed in

Abstract

Article activity feed

Related articles

Multigranular Unified Synthesis Encoder for Fine-grained Multimodal Emotion Understanding

Disentangled Representation Learning with Temporal Smoothness Constraints for Multimodal Sentiment Analysis

NeuroConText: Contrastive Learning for Neuroscience Meta-Analysis with Rich Text Representation