Bridging Vision and Texts: An External Graph Framework for Enhanced Language Comprehension

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

In this work, we introduce a novel framework that augments language understanding systems with external multimodal graph structures. Instead of increasing the internal capacity of language models by scaling parameters, our approach leverages a dedicated external repository—an enriched knowledge graph—to provide additional visual and textual cues during inference. Specifically, given multilingual inputs (for example, German sentences), our method retrieves corresponding entities from the graph and incorporates their multimodal embeddings to boost performance on various downstream tasks. Our framework, herein referred to as \textbf{AlphaKG}, integrates state-of-the-art tuple-based and graph-based learning strategies to generate representations for entities and their inter-relations. By fusing data from diverse modalities such as textual descriptions available in 14 languages and multiple visual samples per entity, we design a robust representation learning scheme that is predictive of the underlying graph structure. Experiments on multilingual named entity recognition (NER) and crosslingual visual verb sense disambiguation (VSD) show promising results, with improvements reaching up to $0.7\%$ in F1 score for NER and up to $2.5\%$ in accuracy for VSD. Additionally, we derive new equations to refine the integration process between the retrieved external features and the language model inputs, thereby offering a comprehensive solution to enhance parameter efficiency while maintaining competitive performance.

Article activity feed