A State-of-the-Art Review on Multimodal Deep Learning for Medical Diagnosis: Integrating Text and Image

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Multimodal deep learning has now been recognized as a prominent paradigm in the field of computer-assisted medical diagnosis, which facilitates the integration of diverse data sources such as medical images, text data, physiological signals, and electronic health records. The traditional single-modal medical diagnostic systems tend to fail in handling the complex and coupled aspects of medical information, leading to suboptimal diagnostic performance, low robustness, and poor generalizability across different disease domains. In contrast, multimodal learning systems utilize supplementary information from other modalities to create more abstract feature representations, thus providing better support for more accurate clinical decision-making. This review provides a comprehensive and systematic treatment of recent developments in multimodal deep learning-based medical diagnostics, which considers the published work from 2021 to 2025. The review focuses on six major domains of diseases: Brain and Neurological Disorders, Lung and Respiratory Diseases, Cancer and Oncology, Cardiovascular Diseases, Infectious Diseases, and General Multimodal Diagnostic Frameworks. In these domains, various types of datasets like imaging, text, audio, genetic, and time-series clinical data are considered together with data preprocessing methods, fusion architectures, evaluation metrics, and explainability tools. The results show a significant shift from conventional convolutional neural network-based, and late fusion-based architectures to more advanced models such as transformer-based and attention-driven models, which are adept at aligning representations across different modalities. The growing availability of self-supervised, synthetic data augmentation, and multimodal large language models has significantly improved overall performance, especially in scenarios of limited clinical data availability. However, several challenges still persist in multimodal deep learning, including the unavailability of multimodal datasets, complexity of modality fusion, spatio-temporal misalignment, and the trade-off between interpretability and performance accuracy. This review indicates the potential of multimodal deep learning in improving the accuracy, robustness, and clinical relevance of medical diagnostics in various applications, while also suggesting important research directions required for widespread and interpretable clinical practice.

Article activity feed