A Comprehensive Survey of Multimodal Large Language Models: Concept, Application and Safety

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Recent advancements in MLLM, such as those exemplified by developments like GPT-4o, have positioned them as a significant focus within the research community. MLLMs leverage the general capabilities of Large Language Models (LLMs) to handle tasks across multiple modalities, including text, image, audio, and video. With their unique ability to understand and generate content, such as composing narratives from visual inputs, MLLMs are attracting substantial interest from both academia and industry. However, the great outburst of algorithms and techniques of MLLMs has led to the emergence of new types of architectures, applications and safety issues in MLLMs. We provide this more comprehensive survey aiming to document and analyze the latest advancements in MLLMs. First, we introduce the fundamental concepts of MLLMs, including the development history of multimodal algorithms, the architecture of MLLMs, and their evaluation and benchmarks. We then explore advanced techniques in MLLMs, such as Multimodal In-Context Learning, Multimodal Chain of Thought, and LLM-aided Visual Reasoning. Following this, we examine the safety aspects of MLLMs, focusing on security issues, potential attacks, and model safety assessments. Finally, we discuss the current challenges and identify potential areas for future research.

Article activity feed