A Systematic Review and Implementation Guidelines of Multimodal Foundation Models in Medical Imaging
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Artificial Intelligence (AI) holds immense potential to transform healthcare, yet progress is often hindered by the reliance on large labeled datasets and unimodal data. Multimodal Foundation Models (FMs), particularly those leveraging Self-Supervised Learning (SSL) on multimodal data, offer a paradigm shift towards label-efficient, holistic patient modeling. However, the rapid emergence of these complex models has created a fragmented landscape. Here, we provide a systematic review of multimodal FMs for medical imaging applications. Through rigorous screening of 1,144 publications (2012–2024) and in-depth analysis of 48 studies, we establish a unified terminology and comprehensively assess the current state-of-the-art. Our review aggregates current knowledge, critically identifies key limitations and underexplored opportunities, and culminates in actionable guidelines for researchers, clinicians, developers, and policymakers. This work provides a crucial roadmap to navigate and accelerate the responsible development and clinical translation of next-generation multimodal AI in healthcare.
