Bridging Medical Imaging and Reports:Learning Radiologist's Nuances via Fine-Grained Multi-Modal Alignment

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Precise and explainable alignment between data from different data modalities is crucial for advancing artificial general intelligence in medicine. In this work, we present CAMMAL (Cyclic Adaptive Medical Modality ALignment), a framework that can achieve fine-grained vision-language alignment through two key innovations, including an Adaptive Patch-Word Matching (AdaMatch) mechanism that dynamically correlates regions in medical images with specific words in radiology reports, and a bidirectional generative architecture that leverages the alignment between textual and visual codebooks to guide the translation between modalities within a single model. Evaluation of CAMMAL on the chest X-rays (combined MIMIC-CXR and OpenI datasets) and mammography (EMBED dataset) demonstrates its superior performance across multiple metrics over other methods. Human reader studies by radiologists validate the clinical effectiveness of the generated reports and synthetic images, particularly in capturing anatomical structures (73% rated good/excellent) and meaningful findings (56% rated very good/excellent). By enabling systematic capture of inter-modality medical data correspondence and fluid multi-way translation between modalities, CAMMAL could advance the development of interpretable and clinically reliable AI systems by bridging the gap between medical images and textual descriptions. Its ability to accurately retrieve, generate, and align reports with imaging data offers potential improvements in learning detailed radiologists' knowledge, which could be vital for AI-assisted diagnosis and medical education. The bidirectional generative approach further enhances model transparency and explainability, fostering trust for AI-driven healthcare applications. As multimodal medical foundation models continue to evolve, CAMMAL provides a foundation for integrating vision and language understanding, with the potential for broader clinical applications beyond radiology.

Article activity feed