Towards Multimodal Retrieval-Augmented Generation for Medical Visual Question Answering

Mai A. Shaaban
Mohammad Reza Zarei
Adnan Khan
Abbas Akkasi
Mohammad Yaqub
Majid Komeili

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Medical visual question answering (MedVQA) is a critical AI healthcare task that combines medical image analysis with natural language understanding to assist clinicians in decision-making. While medical vision-language models have shown promise in this domain, they struggle with factual inaccuracies and hallucinations. Retrieval-augmented generation (RAG) improves the factual accuracy by grounding responses in external knowledge, yet text-only or image-only retrieval systems struggle with the inherently multimodal nature of medical data, leading to information loss. This paper proposes a novel multimodal RAG framework tailored for MedVQA, which leverages multimodal data, including medical images, reports, and generated captions, to provide more accurate clinical answers. We introduce a training paradigm that uses captions as auxiliary supervision, enhancing cross-modal alignment via contrastive learning. Comprehensive evaluations on MedVQA benchmarks demonstrate the framework’s effectiveness, achieving a 7% average accuracy improvement over unimodal RAG baselines. Our study has the potential to better support clinicians in delivering accurate, timely, and trustworthy patient care by improving the reliability of MedVQA systems. The code is publicly available at https://github.com/AiMl-hub/MM-RAG-MedVQA.

Version published to 10.21203/rs.3.rs-7752202/v1 on Research Square
Oct 28, 2025

SMILES Challenge 2025: Multitask Learning with Contrastive and Natural Language Generation for Enhanced Medical Image Classification

This article has 2 authors:
1. Raja Vavekanand
2. Teerath Kumar
This article has no evaluationsLatest version Oct 27, 2025
Context-Guided Multi-Branch Fusion for Text-Dependent Visual Question Reasoning

This article has 4 authors:
1. Sander Ridder
2. Noor Verbeeck
3. Callum Hensley
4. Luca Vandenberghe
This article has no evaluationsLatest version Oct 27, 2025
Differentiable Retrieval-Guided Multimodal Reasoning for Knowledge-Intensive Visual Question Understanding

This article has 4 authors:
1. Camille Dupuis
2. Juliette Declerck
3. Elodie Fairchild
4. Thomas Damme
This article has no evaluationsLatest version Oct 23, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

SMILES Challenge 2025: Multitask Learning with Contrastive and Natural Language Generation for Enhanced Medical Image Classification

Context-Guided Multi-Branch Fusion for Text-Dependent Visual Question Reasoning

Differentiable Retrieval-Guided Multimodal Reasoning for Knowledge-Intensive Visual Question Understanding