Visual Question Answering Based on Visual Contentand Query Enhancement

Longbao Wang
Yuxin Shao
Jinhao Zhang
Meng Ding
Hongmin Gao

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

With the rapid development of computer vision and natural language processing technologies, visual question answering (VQA), as an intersection of these two fields, is gradually becoming a research hotspot. Existing VQA models have achieved significant progress in general scenarios, but in data-biased scenarios, methods for mitigating bias still suffer from poor transferability. Meanwhile, in recent years, multimodal information enhancement techniques have made remarkable progress in exploring and integrating cross-modal semantics, offering powerful cross-modal semantic representation capabilities that provide new solutions for addressing issues caused by data bias. Therefore, this paper takes multimodal information enhancement as the entry point and designs a VQA method based on visual content and query enhancement to meet the demands of data-biased scenarios, aiming to improve the model’s reasoning capabilities and overall performance in such contexts.

Version published to 10.21203/rs.3.rs-8318402/v1 on Research Square
Feb 10, 2026

Enhancing Spatial Cognition in MLLMs with Depth Maps and Point Cloud Data

This article has 5 authors:
1. Wang Zhenxing
2. Ruidi Qi
3. Ziyan Wu
4. Xuan Dou
5. Dehu Du
This article has no evaluationsLatest version Mar 13, 2026
Spatial Intelligence in Vision-Language Models: A Comprehensive Survey

This article has 8 authors:
1. Disheng Liu
2. Tuo Liang
3. Zhe Hu
4. Jierui Peng
5. Yiren Lu
6. Yi Xu
7. Yun Fu
8. Yu Yin
This article has no evaluationsLatest version Mar 4, 2026
Image Narrator: Bridging Visuals and Language Through AI-Powered Captioning

This article has 5 authors:
1. Indrajit Pal
2. Ashoktaru Pal
3. Susmita Halder
4. Saptarsi Das
5. Sagnik Mondal
This article has no evaluationsLatest version Feb 3, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Enhancing Spatial Cognition in MLLMs with Depth Maps and Point Cloud Data

Spatial Intelligence in Vision-Language Models: A Comprehensive Survey

Image Narrator: Bridging Visuals and Language Through AI-Powered Captioning