Multi-Modal Deep Learning Analysis: Review and Applications

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Multi-modal Deep Learning (MMDL) depicts a significant advancement in artificial intelligence (AI) by combining different modalities such as text, image, audio, sensor data, etc., to develop systems that mimic human cognitive abilities in perception, reasoning, and decision-making. This paper reviews MMDL and classifies its main challenges into five categories: representation, alignment, fusion, co-learning, and translation. The field of MMDL addresses important issues such as cross-modal representation learning, temporal and structural alignment, and multi-modal fusion, all aimed at enhancing robustness and interpretability in decision-making processes. A bibliometric analysis identifies key research trends and applications across various domains. This study extensively explores the applications of MMDL covering healthcare and medical imaging, autonomous systems, natural language processing, environmental monitoring, social media analysis and mining. These applications demonstrate the increasing reliance on multi-modal architectures to improve predictive accuracy and decision support. Notably, healthcare has seen significant developments in disease diagnosis and medical image interpretation through multi-modal fusion, while autonomous systems leverage cross-modal learning for perception and navigation. Advances in encoding-decoding frameworks and cross-modal correlation modelling have led to significant progress in tasks such as visual question answering, medical diagnostics, and sentiment analysis. Nonetheless, the mining industry has not yet fully explored the possible applications of this technology, showing a considerable research gap that necessitates further investigation. This paper aims to serve as a foundational resource for advancing research in MMDL and its various applications.

Article activity feed