KR-VLM: Enhancing Factual Reasoning in Vision-Language Models via Knowledge Retrieval and Self-Verification

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Vision-Language Models (VLMs) exhibit powerful capabilities in visual and textual understanding, significantly advancing tasks like Visual Question Answering (VQA). However, hallucination remains a persistent challenge, as VLMs generate responses factually inconsistent with the input image or common sense. This undermines their reliability and trustworthiness, especially in scenarios demanding precise factual reasoning or complex scene comprehension. To address this, we propose KR-VLM (Knowledge-Retrieved Reasoning for Vision-Language Models), a novel approach enhancing VLM factual reasoning and significantly reducing hallucination via knowledge-aware self-supervised learning. KR-VLM integrates a Knowledge Retrieval Module (KRM) to access external facts, a Knowledge Fusion \& Calibration Adapter (KFCA) to seamlessly integrate cross-modal knowledge, and a Self-Factual Verification Module (SFVM) to self-correct factual inconsistencies during training. Leveraging an established VLM architecture, our method is lightweight and requires no extensive human annotation of knowledge or reasoning paths. Extensive experiments on VQAv2, GQA, OK-VQA, and DocVQA benchmarks show that KR-VLM consistently outperforms state-of-the-art baselines in VQA accuracy and, crucially, achieves a superior Factual Consistency Score, demonstrating its effectiveness in mitigating hallucination.

Article activity feed