Fusion of Visual and Textual Data for Enhanced Semantic Representations

Lyra Sterling
Kairos Vale
Ava Martinez

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Generic text embeddings have demonstrated considerable success across a multitude of applications. However, these embeddings are typically derived by modeling the co-occurrence patterns within text-only corpora, which can limit their ability to generalize effectively across diverse contexts. In this study, we investigate methodologies that incorporate visual information into textual representations to overcome these limitations. Through extensive ablation studies, we introduce a novel and straightforward architecture named VisualText Fusion Network (VTFN). This architecture not only surpasses existing multimodal approaches on a range of well-established benchmark datasets but also achieves state-of-the-art performance on image-related textual datasets while utilizing significantly less training data. Our findings underscore the potential of integrating visual modalities to substantially enhance the robustness and applicability of text embeddings, paving the way for more nuanced and contextually rich semantic representations.

Version published to 10.20944/preprints202409.2066.v1
Sep 26, 2024

Revisiting Multimodal and Unimodal Representation Strategies for Document-level Relation Extraction

This article has 3 authors:
1. Freja Lindholm
2. Wyne Nasir
3. Emil Sörensen
This article has no evaluationsLatest version May 15, 2025
A Multimodal Information Mining and Classification Framework for Textual Content Understanding in Complex Video Scenes

This article has 3 authors:
1. Kinsley Harper
2. Wyne Nasir
3. Jaxon Everett
This article has no evaluationsLatest version May 16, 2025
Information-Optimized and Adaptive Document Segmentation for Multilingual Knowledge Graphs

This article has 3 authors:
1. Diqi Si
2. Yuwen Wei
3. Leiwu Wen
This article has no evaluationsLatest version Jun 6, 2025

Listed in

Abstract

Article activity feed

Related articles

Revisiting Multimodal and Unimodal Representation Strategies for Document-level Relation Extraction

A Multimodal Information Mining and Classification Framework for Textual Content Understanding in Complex Video Scenes

Information-Optimized and Adaptive Document Segmentation for Multilingual Knowledge Graphs