Towards Multimodal Retrieval-Augmented Generation for Medical Visual Question Answering

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Medical visual question answering (MedVQA) is a critical AI healthcare task that combines medical image analysis with natural language understanding to assist clinicians in decision-making. While medical vision-language models have shown promise in this domain, they struggle with factual inaccuracies and hallucinations. Retrieval-augmented generation (RAG) improves the factual accuracy by grounding responses in external knowledge, yet text-only or image-only retrieval systems struggle with the inherently multimodal nature of medical data, leading to information loss. This paper proposes a novel multimodal RAG framework tailored for MedVQA, which leverages multimodal data, including medical images, reports, and generated captions, to provide more accurate clinical answers. We introduce a training paradigm that uses captions as auxiliary supervision, enhancing cross-modal alignment via contrastive learning. Comprehensive evaluations on MedVQA benchmarks demonstrate the framework’s effectiveness, achieving a 7% average accuracy improvement over unimodal RAG baselines. Our study has the potential to better support clinicians in delivering accurate, timely, and trustworthy patient care by improving the reliability of MedVQA systems. The code is publicly available at https://github.com/AiMl-hub/MM-RAG-MedVQA.

Article activity feed