Spatial Intelligence in Vision-Language Models: A Comprehensive Survey

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Vision-language models have achieved impressive progress, yet they still struggle with spatial intelligence—understanding where objects are, how they relate, and how space changes across viewpoints. This limitation matters for embodiedAI, autonomous driving, and spatially consistent generation. Meanwhile, rapid advances in spatially enhanced VLMs have produced a scattered literature with inconsistent terminology, methods, and evaluation practices. In this survey, we provide the first unified overview of the field. We summarize core concepts behind spatial reasoning in VLMs, analyze why spatial failures occur, and organize existing solutions into a clear framework spanning prompting-based techniques, model improvements, explicit 2D cues, 3D enrichment, and data-driven strategies. We also examine how spatial ability is currently measured and report an empirical study across 37 models and 9 representative benchmarks. Our analysis highlights current best-performing approaches, clarifies when different strategies help or fail, and shows that many widely used benchmarks do not reliably capture true spatial understanding. By consolidating evidence and outlining open challenges, this survey offers a practical roadmap for building more spatially capable VLMs. We release our evaluation code and maintain a curated paper repository to support the rapidly growing research on spatial intelligence in vision-language models.

Article activity feed