Orchestrating Visual and Linguistic Modalities for Robust Spatial Intelligence in LVLMs

Jie-Hao Lim
Carter Ross
Gavin Walker

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The ability of large vision-language models (LVLMs) to understand and reason about complex spatial relationships within visual scenes is critical for advancing artificial intelligence, particularly in domains like robotics and augmented reality. Despite their impressive general capabilities, current LVLMs often struggle with fine-grained spatial grounding, exhibiting limitations in precisely describing relative object positions, sizes, and distances, and in performing multi-step spatial reasoning. This paper introduces Multi-Granularity Spatial-Relational Graph Transformer (MGS-RGT) Training, a novel two-stage learning paradigm designed to significantly enhance LVLMs' spatial intelligence. Our method first involves Hierarchical Spatial Graph Prediction (HSGP) pre-training, which rigorously trains the visual encoder to represent multi-scale spatial relationships (fine-grained, object-level, and scene-level) through explicit graph learning. Following this, the full LVLM undergoes Spatially-Grounded Language Generation (SGLG) fine-tuning, enriched with Chain-of-Thought (CoT) integration, guiding the model to articulate its spatial reasoning process. Comprehensive experiments on standard VQA benchmarks and our new Spatially-Grounded Interaction Dataset (SGID) demonstrate that MGS-RGT consistently and substantially outperforms state-of-the-art baselines across Spatial VQA Accuracy, Relationship Prediction F1-Score, and Task Success Rate for embodied tasks. Ablation studies confirm the critical contributions of both HSGP pre-training and CoT integration. Qualitative analysis and human evaluations further corroborate MGS-RGT's ability to generate highly accurate, detailed, and coherent spatial descriptions, validating its superior spatial reasoning capabilities.

Version published to 10.31224/4788
Jul 5, 2025

Does Human-Like Contextual Object Recognition Emerge from Language Supervision and Language-Guided Inference?

This article has 3 authors:
1. Karim Rajaei
2. Radoslaw Martin Cichy
3. Hamid Soltanian-Zadeh
This article has no evaluationsLatest version Jul 24, 2025
Semantic Saliency from Multi-Modal Large Language Model Scene Understanding Maps

This article has 5 authors:
1. Shravan Murlidaran
2. Ziqi Wen
3. Jonathan Skaza
4. William Wang
5. Miguel P Eckstein
This article has no evaluationsLatest version Aug 1, 2025
Reimagining Efficiency in Vision-Language Models Through Low-Precision Training Across Modalities and Architectures

This article has 6 authors:
1. Beverley Marion
2. Rafael Kim
3. Amina Chowdhury
4. Julian E. Navarro
5. Lihua Zhang
6. Omar Farouk
This article has no evaluationsLatest version Aug 3, 2025

Listed in

Abstract

Article activity feed

Related articles

Does Human-Like Contextual Object Recognition Emerge from Language Supervision and Language-Guided Inference?

Semantic Saliency from Multi-Modal Large Language Model Scene Understanding Maps

Reimagining Efficiency in Vision-Language Models Through Low-Precision Training Across Modalities and Architectures