Enhancing Spatial Reasoning in Large Vision-Language Models

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

In this paper, we present a novel approach, Spatial-Aware Language Enhancement (SALE) , designed to improve the spatial reasoning capabilities of large language models (LLMs) using structured textual descriptions, without the need for visual inputs. Spatial reasoning is critical for understanding object relationships, navigation, and scene layouts in applications such as robotics and autonomous systems. Despite significant advancements in vision-language models (LVLMs), existing models still struggle with spatial tasks, especially when reliant solely on textual data. To address this, we propose a two-phase training strategy involving Spatial Description Pre-training (SDP) and fine-tuning on an enhanced benchmark dataset, SpatialEval+. Our experimental results demonstrate that SALE achieves state-of-the-art performance across various spatial tasks, outperforming existing models like GPT-4 and GPT-4+BLIP, while also being more resource-efficient. Further analysis, including ablation studies and human evaluation, confirms the effectiveness of our approach, indicating its potential for real-world applications where efficient and accurate spatial understanding is required.

Article activity feed