Are Vision-Language Models Truly Understanding Multi-vision Sensor?

Sangyun Chung
Youngjoon Yu
Youngchae Chee
Se Yeon Kim
Byung-Kwan Lee
Yong Man Ro

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large-scale Vision-Language Models (VLMs) have advanced by aligning vision inputs with text, significantly improving performance in computer vision tasks. Moreover, for VLMs to be effectively utilized in real-world applications, an understanding of diverse multi-vision sensor data, such as thermal, depth, and X-ray information, is essential. However, we find that current VLMs process multi-vision sensor images without deep understanding of sensor information, disregarding each sensor’s unique physical properties. This limitation restricts their capacity to interpret and respond to complex questions requiring multi-vision sensor reasoning. To address this, we propose a novel Multi-vision Sensor Perception and Reasoning (MS-PR) benchmark, assessing VLMs on their capacity for sensor-specific reasoning. Moreover, we introduce Diverse Negative Attributes (DNA) optimization to enable VLMs to perform deep reasoning on multi-vision sensor tasks, helping to bridge the core information gap between images and sensor data. Extensive experimental results validate that the proposed DNA method can significantly improve the multi-vision sensor reasoning for VLMs. Codes and data are available at https://github.com/top-yun/MS-PR

Version published to 10.32388/vn42c2
Jan 30, 2025

Multimodal Vision Language Models in Interactive and Physical Environments

This article has 4 authors:
1. Lucas Pereira
2. Martina Kovács
3. Ahmed El-Masry
4. Feidlimid Shyama
This article has no evaluationsLatest version Dec 26, 2025
Image and Video Question Answering with Large Language Models: A Comprehensive Review

This article has 3 authors:
1. Alexander Davis
2. Justin Parker
3. Julian Perry
This article has no evaluationsLatest version Dec 19, 2025
Two-Stage Fine-Tuning of Large Vision-Language Models with Hierarchical Prompting for Few-Shot Object Detection in Remote Sensing Images

This article has 7 authors:
1. Yongqi Shi
2. Ruopeng Yang
3. Changsheng Yin
4. Yiwei Lu
5. Bo Huang
6. Yu Tao
7. Yihao Zhong
This article has no evaluationsLatest version Jan 14, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Multimodal Vision Language Models in Interactive and Physical Environments

Image and Video Question Answering with Large Language Models: A Comprehensive Review

Two-Stage Fine-Tuning of Large Vision-Language Models with Hierarchical Prompting for Few-Shot Object Detection in Remote Sensing Images