Spatial Intelligence in Vision-Language Models: A Comprehensive Survey

Disheng Liu
Tuo Liang
Zhe Hu
Jierui Peng
Yiren Lu
Yi Xu
Yun Fu
Yu Yin

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Vision-language models have achieved impressive progress, yet they still struggle with spatial intelligence—understanding where objects are, how they relate, and how space changes across viewpoints. This limitation matters for embodiedAI, autonomous driving, and spatially consistent generation. Meanwhile, rapid advances in spatially enhanced VLMs have produced a scattered literature with inconsistent terminology, methods, and evaluation practices. In this survey, we provide the first unified overview of the field. We summarize core concepts behind spatial reasoning in VLMs, analyze why spatial failures occur, and organize existing solutions into a clear framework spanning prompting-based techniques, model improvements, explicit 2D cues, 3D enrichment, and data-driven strategies. We also examine how spatial ability is currently measured and report an empirical study across 37 models and 9 representative benchmarks. Our analysis highlights current best-performing approaches, clarifies when different strategies help or fail, and shows that many widely used benchmarks do not reliably capture true spatial understanding. By consolidating evidence and outlining open challenges, this survey offers a practical roadmap for building more spatially capable VLMs. We release our evaluation code and maintain a curated paper repository to support the rapidly growing research on spatial intelligence in vision-language models.

Version published to 10.21203/rs.3.rs-8919250/v1 on Research Square
Mar 4, 2026

DMES: Information-Equivalent Evaluation Reveals the Physical Reasoning Gap Between World Models and Language Models

This article has 1 author:
1. Liutao Hu
This article has no evaluationsLatest version Apr 7, 2026
Learning to Model the World: A Survey of World Models in Artificial Intelligence

This article has 19 authors:
1. Jiahua Dong
2. Qi Lyu
3. Baichen Liu
4. Xudong Wang
5. Wenqi Liang
6. Duzhen Zhang
7. Jiahang Tu
8. Hongliu Li
9. Hanbin Zhao
10. Henghui Ding
11. Yulun Zhang
12. Zhi Han
13. Nicu Sebe
14. Fahad Shahbaz Khan
15. Salman Khan
16. Mubarak Shan
17. Philip Torr
18. Ming-Hsuan Yang
19. Dacheng Tao
This article has no evaluationsLatest version Mar 10, 2026
How Can Hallucinatory Biases Be Effectively Audited and Mitigated in Vision-Language Models? A Unified Theoretical and Empirical Framework Across GPT-4o, Grok 3, and Claude Sonnet 4.5

This article has 1 author:
1. Amirali Ghajari
This article has no evaluationsLatest version Apr 8, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

DMES: Information-Equivalent Evaluation Reveals the Physical Reasoning Gap Between World Models and Language Models

Learning to Model the World: A Survey of World Models in Artificial Intelligence

How Can Hallucinatory Biases Be Effectively Audited and Mitigated in Vision-Language Models? A Unified Theoretical and Empirical Framework Across GPT-4o, Grok 3, and Claude Sonnet 4.5