Visual Question Answering Based on Visual Contentand Query Enhancement

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

With the rapid development of computer vision and natural language processing technologies, visual question answering (VQA), as an intersection of these two fields, is gradually becoming a research hotspot. Existing VQA models have achieved significant progress in general scenarios, but in data-biased scenarios, methods for mitigating bias still suffer from poor transferability. Meanwhile, in recent years, multimodal information enhancement techniques have made remarkable progress in exploring and integrating cross-modal semantics, offering powerful cross-modal semantic representation capabilities that provide new solutions for addressing issues caused by data bias. Therefore, this paper takes multimodal information enhancement as the entry point and designs a VQA method based on visual content and query enhancement to meet the demands of data-biased scenarios, aiming to improve the model’s reasoning capabilities and overall performance in such contexts.

Article activity feed