Emergent Semantics from Disjoint Modalities: Unsupervised Cross-Domain Vision-Language Grounding
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The rapid evolution of multimodal representation learning has yielded increasingly powerful vision-and-language (V\&L) systems, achieving remarkable success across diverse downstream tasks. Yet, most current solutions are fundamentally constrained by their dependence on large-scale parallel corpora, where each image is paired with a manually curated caption. The construction of such datasets is not only resource-intensive but also ill-suited for domain-specific or low-resource scenarios. In this study, we present \textbf{VISTRA} (\textbf{VIS}ion-Text Representation Alignment), a new paradigm for V\&L pre-training that circumvents the need for explicitly aligned data. Drawing inspiration from research in unsupervised machine translation and multilingual embedding learning, VISTRA integrates a dual-modality masked reconstruction mechanism with semantic anchors extracted from object detection pipelines. These anchors function as modality-agnostic pivots, enabling implicit cross-modal grounding even in the absence of direct correspondence. Our experiments on four widely adopted English benchmarks demonstrate that VISTRA consistently matches, and in certain cases surpasses, the performance of supervised counterparts trained with aligned image-caption pairs. Beyond its empirical competitiveness, our approach exposes the latent geometric structure of multimodal spaces, revealing that disjoint corpora can, with the aid of semantic anchoring, support effective representation alignment. This work therefore not only reduces reliance on costly annotation, but also highlights the feasibility of constructing scalable and transferable V\&L models from unpaired multimodal resources.