Enhancing Spatial Cognition in MLLMs with Depth Maps and Point Cloud Data
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Contemporary multimodal large language models (MLLMs), particularly those integrating visual and textual modalities, have demonstrated remarkable capabilities in both image comprehension and text generation. However, current multimodal learning paradigms predominantly focus on RGB images and traditional text, often exhibiting limitations in spatial cognition. In this study, we enhance the spatial understanding of MLLMs by preprocessing raw images to extract spatial information such as depth maps and point cloud data, subsequently incorporating these into the learning process. Additionally, we employ instruction tuning techniques with comprehensive and detailed textual descriptions to enrich the model’s spatial awareness. Our experiments reveal that models trained with these enhancements surpass baseline models in tasks such as image captioning and visual question answering (VQA). Although traditional metrics such as CIDEr and ROUGE-L show improvement, they fail to capture the model's enhanced spatial reasoning abilities, necessitating complementary evaluation methods like Ref_LongCLIPScore. Empirically, we observed statistically significant improvements: a 1.5% absolute increase in Ref_LongCLIPScore (p < 0.05) and a 1.2% boost in the average accuracy of VQA tasks (p < 0.05). These gains underscore the model’s superior performance in describing spatial relationships within images. Our model weights are publicly available at huggingface.co/fisheries/wcsllava/tree/main.