A Vision-Language Model with Multi-Granular Knowledge Fusion in Medical Imaging
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The rapid expansion of radiological imaging data has placed a significant burden on radiologists, increasing the risk of diagnostic errors. Vision-language models offer a promising solution to alleviate this workload and improve diagnostic accuracy within the medical imaging domain. However, most current models rely solely on training data to activate general-purpose performance, which often results in inadequate understanding and generation of high-quality outputs in complex and specialized medical scenarios due to insufficient domain knowledge. To address this limitation, we propose a Vision-Language Model with Multi-Granular Knowledge Fusion (MGKF) that integrates diverse sources of knowledge to enhance performance across medical imaging tasks. Our model dynamically incorporates multi-granular knowledge, including medical entities, their definitions, and retrieved auxiliary knowledge. We improve the semantic alignment of visual and textual information through fine-tuning, introduce a pre-generation mechanism to incorporate this multi-granular knowledge, and ultimately enhance the model's ability to apply medical knowledge during inference. Experimental results across multiple medical imaging tasks, including Medical Report Generation, Medical Image Captioning, and Medical Visual Question Answering, demonstrate the effectiveness of the proposed MGKF model. This work provides valuable insights into the integration of specialized knowledge in medical imaging and contributes to reducing diagnostic errors.