Emergent Semantics from Disjoint Modalities: Unsupervised Cross-Domain Vision-Language Grounding

Milan Janssens
Noor Peeters
Callum Hensley

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The rapid evolution of multimodal representation learning has yielded increasingly powerful vision-and-language (V\&L) systems, achieving remarkable success across diverse downstream tasks. Yet, most current solutions are fundamentally constrained by their dependence on large-scale parallel corpora, where each image is paired with a manually curated caption. The construction of such datasets is not only resource-intensive but also ill-suited for domain-specific or low-resource scenarios. In this study, we present \textbf{VISTRA} (\textbf{VIS}ion-Text Representation Alignment), a new paradigm for V\&L pre-training that circumvents the need for explicitly aligned data. Drawing inspiration from research in unsupervised machine translation and multilingual embedding learning, VISTRA integrates a dual-modality masked reconstruction mechanism with semantic anchors extracted from object detection pipelines. These anchors function as modality-agnostic pivots, enabling implicit cross-modal grounding even in the absence of direct correspondence. Our experiments on four widely adopted English benchmarks demonstrate that VISTRA consistently matches, and in certain cases surpasses, the performance of supervised counterparts trained with aligned image-caption pairs. Beyond its empirical competitiveness, our approach exposes the latent geometric structure of multimodal spaces, revealing that disjoint corpora can, with the aid of semantic anchoring, support effective representation alignment. This work therefore not only reduces reliance on costly annotation, but also highlights the feasibility of constructing scalable and transferable V\&L models from unpaired multimodal resources.

Version published to 10.20944/preprints202509.0958.v1
Sep 11, 2025

Decoupled Yet Aligned Transformer for Semantic Image-Text Retrieval

This article has 4 authors:
1. Finn Alexander
2. Linh Anh
3. Jannat Roy
4. Ava Grace
This article has no evaluationsLatest version Sep 5, 2025
Zero-Shot Image Super-Resolution Using Prompt-Driven Vision-Language Foundation Models Without Task-Specific Fine-Tuning

This article has 1 author:
1. K. AKILA
This article has no evaluationsLatest version Sep 1, 2025
Caption-Grounded Structural Parsing for Compound Scientific Visuals

This article has 4 authors:
1. Omar Al-Mansoori
2. Aiden Johnson
3. Ava Martinez
4. Oliver Smith
This article has no evaluationsLatest version Aug 29, 2025

Listed in

Abstract

Article activity feed

Related articles

Decoupled Yet Aligned Transformer for Semantic Image-Text Retrieval

Zero-Shot Image Super-Resolution Using Prompt-Driven Vision-Language Foundation Models Without Task-Specific Fine-Tuning

Caption-Grounded Structural Parsing for Compound Scientific Visuals