Interleaved Multi-Modal Document Representations for Large-Scale Information Retrieval Using Large Language Models
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The exponential growth of multi-modal data has posed significant challenges for traditional information retrieval systems, which often struggle to effectively integrate and process content across different formats, such as text and images. A novel approach is introduced through a unified interleaved representation that enables the seamless integration of multi-modal inputs, allowing for more accurate and efficient retrieval of complex content. The method leverages cross-modal attention mechanisms and advanced token interleaving, enhancing the model's ability to capture intricate relationships between modalities within a shared latent space. Experimental results demonstrate substantial improvements in retrieval accuracy, precision, and computational efficiency when compared to existing baseline models. The proposed system not only improves the overall performance of large-scale retrieval systems but also offers a scalable and robust framework for processing increasingly complex and diverse datasets.