Enhancing Clinical Reasoning in Medical Vision-Language Model through Structured Prompts
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Medical Vision-Language Models (MVLMs) are emerging as powerful tools for tasks such as Visual Question Answering (VQA); however, they often struggle with hallucination and limited reasoning transparency, particularly in complex diagnostic scenarios. In this work, we enhance the MedVLM-R1 framework by fine-tuning it using clinically informed prompt structures tailored specifically for radiology-based reasoning. Without altering the original model architecture or training strategy, we redesign the system prompts and question templates to guide the model through structured, modality-aware, and step-by-step diagnostic reasoning. Fine-tuning is performed using MRI-based question-answer (QA) pairs, and evaluations are conducted across three diagnostic imaging: MRI, CT, and X-ray to assess both in-domain and out-of-domain generalization. Our approach improves reasoning transparency and accuracy, achieving 96.00% on MRI, 72.67% on CT, and 75.2% on X-ray. Compared to the original MedVLM-R1, our method closes the gap in MRI accuracy while significantly enhancing generalization performance on CT and X-ray modalities. These results demonstrate that clinically grounded prompting effectively improves both reasoning fidelity and robustness across imaging modalities. The code is available at our GitHub repository: https://github.com/aidanbio/AIdanMed