Enhancing Clinical Reasoning in Medical Vision-Language Model through Structured Prompts

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Medical Vision-Language Models (MVLMs) are emerging as powerful tools for tasks such as Visual Question Answering (VQA); however, they often struggle with hallucination and limited reasoning transparency, particularly in complex diagnostic scenarios. In this work, we enhance the MedVLM-R1 framework by fine-tuning it using clinically informed prompt structures tailored specifically for radiology-based reasoning. Without altering the original model architecture or training strategy, we redesign the system prompts and question templates to guide the model through structured, modality-aware, and step-by-step diagnostic reasoning. Fine-tuning is performed using MRI-based question-answer (QA) pairs, and evaluations are conducted across three diagnostic imaging: MRI, CT, and X-ray to assess both in-domain and out-of-domain generalization. Our approach improves reasoning transparency and accuracy, achieving 96.00% on MRI, 72.67% on CT, and 75.2% on X-ray. Compared to the original MedVLM-R1, our method closes the gap in MRI accuracy while significantly enhancing generalization performance on CT and X-ray modalities. These results demonstrate that clinically grounded prompting effectively improves both reasoning fidelity and robustness across imaging modalities. The code is available at our GitHub repository: https://github.com/aidanbio/AIdanMed

Article activity feed