Identifying Common Semantics Across Modalities via Contrastive Latent Alignment

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The rapid advances in multimodal representation learning have been significantly driven by contrastive learning strategies, especially in scenarios involving weak supervision and cross-modal correspondence, such as image-text retrieval or audiovisual reasoning. While successful instances like \texttt{CLIP} illustrate the practical effectiveness of contrastive objectives, the theoretical insights into what can be provably recovered from such methods remain incomplete. Prior studies primarily focus on multi-view settings, assuming uniform generative processes across modalities. In this work, we broaden this scope by investigating the identifiability problem in general heterogeneous multimodal settings, where each modality follows its own generative dynamics and encodes distinct, modality-specific latent representations. We introduce a new framework, termed \textbf{CIPHER} (\textbf{C}ontrastive \textbf{I}dentification of \textbf{P}aired \textbf{H}eterogeneous \textbf{E}ncodings via \textbf{R}econstruction), which extends previous identifiability analyses by modeling the multimodal generation process using distinct latent variables for each modality, transformed through nonlinear mixing functions. Our theoretical results establish that, under relatively mild conditions, contrastive learning objectives can still block-identify shared latent semantics, even when latent variables exhibit strong dependencies. Crucially, these identifiability guarantees hold in the presence of modality-specific noise and across structurally divergent generative mechanisms. We empirically validate our theoretical findings using a combination of synthetic simulations and real-world datasets involving paired image-text inputs. The results underscore the robustness and applicability of contrastive learning to complex multimodal generative models. Overall, our work offers a principled explanation for the success of contrastive paradigms in multimodal scenarios and deepens the theoretical foundation underpinning modern multimodal learning techniques.

Article activity feed