Efficient GPT-4V Level Multimodal Large Language Model for Deployment on Edge Devices

Yuan Yao
Tianyu Yu
Ao Zhang
Chongyi Wang
Junbo Cui
Hongji Zhu
Tianchi Cai
Chi Chen
Haoyu Li
Weilin Zhao
Zhihui He
Qianyu Chen
Ronghua Zhou
Zhensheng Zou
Haoye Zhang
Shengding Hu
Zhi Zheng
Jie Zhou
Jie Cai
Xu Han
Guoyang Zeng
Dahai Li
Zhiyuan Liu
Maosong Sun

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally reshaped the landscape of AI research and industry, shedding light on a promising path toward the next AI milestone. However, significant challenges remain preventing MLLMs from being practical in real-world applications. The most notable challenge comes from the huge cost of running an MLLM with a massive number of parameters and extensive computation. As a result, most MLLMs need to be deployed on high-performing cloud servers, which greatly limits their application scopes such as mobile, offline, energy-sensitive, and privacy-protective scenarios. In this work, we present MiniCPM-V, a series of efficient MLLMs deployable on edge devices. By integrating the latest MLLM techniques in architecture, pre-training and alignment, the latest MiniCPM-Llama3-V 2.5 has several notable features: (1) Strong performance, outperforming GPT-4V-1106, Gemini Pro and Claude 3 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks, (2) strong OCR capability and 1.8M pixel high-resolution image perception at any aspect ratio, (3) trustworthy behavior with low hallucination rates, (4) multilingual support for 30+ languages, and (5) efficient deployment on mobile phones. More importantly, MiniCPM-V can be viewed as a representative example of a promising trend (Fig. 1): The model sizes for achieving usable (e.g., GPT-4V) level performance are rapidly decreasing, along with the fast growth of edge computation capacity. This jointly shows that GPT-4V level MLLMs deployed on end devices are becoming increasingly possible, unlocking a wider spectrum of real-world AI applications in the near future.

Version published to 10.21203/rs.3.rs-5830327/v1 on Research Square
Jan 17, 2025

Exploring the Landscape of Large and Small Language Models: Advancements, Trade-offs, and Future Directions

This article has 3 authors:
1. Duha Shams
2. Ikraam Salama
3. Idowu Callixtus
This article has no evaluationsLatest version Jan 7, 2025
Optimizing AI Language Models: A Study of ChatGPT-4 vs. ChatGPT-4o

This article has 5 authors:
1. Md Nurul Absar Siddiky
2. Muhammad Enayetur Rahman
3. MD Fayaz Bin Hossen
4. Muhammad Rezaur Rahman
5. Md. Shahadat Jaman
This article has no evaluationsLatest version Feb 3, 2025
Quantization of a Llama Language Model for improved Efficiency and Inference

This article has 4 authors:
1. S Madhanegha
2. V Vishnuvaradhan
3. R Arun
4. I Surenther
This article has no evaluationsLatest version Feb 17, 2025

Listed in

Abstract

Article activity feed

Related articles

Exploring the Landscape of Large and Small Language Models: Advancements, Trade-offs, and Future Directions

Optimizing AI Language Models: A Study of ChatGPT-4 vs. ChatGPT-4o

Quantization of a Llama Language Model for improved Efficiency and Inference