Enhancing Spatial Cognition in MLLMs with Depth Maps and Point Cloud Data

Wang Zhenxing
Ruidi Qi
Ziyan Wu
Xuan Dou
Dehu Du

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Contemporary multimodal large language models (MLLMs), particularly those integrating visual and textual modalities, have demonstrated remarkable capabilities in both image comprehension and text generation. However, current multimodal learning paradigms predominantly focus on RGB images and traditional text, often exhibiting limitations in spatial cognition. In this study, we enhance the spatial understanding of MLLMs by preprocessing raw images to extract spatial information such as depth maps and point cloud data, subsequently incorporating these into the learning process. Additionally, we employ instruction tuning techniques with comprehensive and detailed textual descriptions to enrich the model’s spatial awareness. Our experiments reveal that models trained with these enhancements surpass baseline models in tasks such as image captioning and visual question answering (VQA). Although traditional metrics such as CIDEr and ROUGE-L show improvement, they fail to capture the model's enhanced spatial reasoning abilities, necessitating complementary evaluation methods like Ref_LongCLIPScore. Empirically, we observed statistically significant improvements: a 1.5% absolute increase in Ref_LongCLIPScore (p < 0.05) and a 1.2% boost in the average accuracy of VQA tasks (p < 0.05). These gains underscore the model’s superior performance in describing spatial relationships within images. Our model weights are publicly available at huggingface.co/fisheries/wcsllava/tree/main.

Version published to 10.21203/rs.3.rs-8634056/v1 on Research Square
Mar 13, 2026

Spatial Intelligence in Vision-Language Models: A Comprehensive Survey

This article has 8 authors:
1. Disheng Liu
2. Tuo Liang
3. Zhe Hu
4. Jierui Peng
5. Yiren Lu
6. Yi Xu
7. Yun Fu
8. Yu Yin
This article has no evaluationsLatest version Mar 4, 2026
Visual Question Answering Based on Visual Contentand Query Enhancement

This article has 5 authors:
1. Longbao Wang
2. Yuxin Shao
3. Jinhao Zhang
4. Meng Ding
5. Hongmin Gao
This article has no evaluationsLatest version Feb 10, 2026
Image Narrator: Bridging Visuals and Language Through AI-Powered Captioning

This article has 5 authors:
1. Indrajit Pal
2. Ashoktaru Pal
3. Susmita Halder
4. Saptarsi Das
5. Sagnik Mondal
This article has no evaluationsLatest version Feb 3, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Spatial Intelligence in Vision-Language Models: A Comprehensive Survey

Visual Question Answering Based on Visual Contentand Query Enhancement

Image Narrator: Bridging Visuals and Language Through AI-Powered Captioning