Decoupled Yet Aligned Transformer for Semantic Image-Text Retrieval
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Retrieving semantically related content across visual and textual modalities remains a central challenge in multimodal artificial intelligence. Despite rapid progress in cross-modal understanding, many existing systems still struggle with balancing modality-specific representation fidelity and scalability in retrieval scenarios. In this paper, we present \textbf{DUET} (Dual-Stream Encoder for Unified Embedding and Translation), a transformer-based architecture that explicitly separates the encoding pipelines of visual and textual modalities in early layers, yet strategically enforces alignment through shared parameters in deeper layers. This modular approach allows DUET to retain modality-specific semantics while constructing a unified latent space suitable for fast and accurate retrieval. Unlike prior architectures that rely on entangled attention mechanisms, DUET’s design enables precomputed indexing and supports efficient large-scale matching. Additionally, we propose a new evaluation protocol grounded in semantic similarity by leveraging caption-level soft relevance, extending beyond traditional binary Recall@K metrics. Our method introduces a similarity-weighted discounted cumulative gain (DCG) scoring scheme to reflect more nuanced relevance patterns. Empirical results on the MS-COCO benchmark demonstrate that DUET consistently outperforms existing methods on both hard and soft retrieval metrics, setting a new state of the art under weakly supervised settings. Code and pre-trained models will be made publicly available upon publication.