Orchestrating Visual and Linguistic Modalities for Robust Spatial Intelligence in LVLMs
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The ability of large vision-language models (LVLMs) to understand and reason about complex spatial relationships within visual scenes is critical for advancing artificial intelligence, particularly in domains like robotics and augmented reality. Despite their impressive general capabilities, current LVLMs often struggle with fine-grained spatial grounding, exhibiting limitations in precisely describing relative object positions, sizes, and distances, and in performing multi-step spatial reasoning. This paper introduces Multi-Granularity Spatial-Relational Graph Transformer (MGS-RGT) Training, a novel two-stage learning paradigm designed to significantly enhance LVLMs' spatial intelligence. Our method first involves Hierarchical Spatial Graph Prediction (HSGP) pre-training, which rigorously trains the visual encoder to represent multi-scale spatial relationships (fine-grained, object-level, and scene-level) through explicit graph learning. Following this, the full LVLM undergoes Spatially-Grounded Language Generation (SGLG) fine-tuning, enriched with Chain-of-Thought (CoT) integration, guiding the model to articulate its spatial reasoning process. Comprehensive experiments on standard VQA benchmarks and our new Spatially-Grounded Interaction Dataset (SGID) demonstrate that MGS-RGT consistently and substantially outperforms state-of-the-art baselines across Spatial VQA Accuracy, Relationship Prediction F1-Score, and Task Success Rate for embodied tasks. Ablation studies confirm the critical contributions of both HSGP pre-training and CoT integration. Qualitative analysis and human evaluations further corroborate MGS-RGT's ability to generate highly accurate, detailed, and coherent spatial descriptions, validating its superior spatial reasoning capabilities.