Benchmark Evaluation of Multi-Modal Large Language Models for Ophthalmic Diagnosis

Weihua Yang
Shoujun Huang
Junhong Chen
Jiaoman Wang
Ping Zhang
Wending Du
Yuan Hong
Dexing Kong
Wei Lou
Wei Chi

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Multi-modal large language models (MLLMs) are increasingly demonstrating significant potential in medical applications, particularly in image-intensive fields such as ophthalmology. While state-of-the-art models like ChatGPT-4o and Qwen-VL 2.5 exhibit impressive performance in general-domain tasks, there remains a lack of real-world clinical benchmark datasets to rigorously evaluate their diagnostic capabilities in specialized medical contexts. To address this gap, we constructed a curated benchmark dataset comprising 295 pathologically confirmed ophthalmic cases with representative clinical presentations. Using this dataset, we conducted a systematic evaluation of nine leading MLLMs, both open-source and proprietary. Our results reveal that models such as HAIBU-REMUD, ChatGPT-4o and Gemini 2.5 achieve high diagnostic accuracy and strong consistency, with performance approaching that of human experts. These findings suggest that current MLLMs have reached a promising stage in terms of applicability to real-world clinical settings, laying the groundwork for their integration into ophthalmology practice.

Version published to 10.21203/rs.3.rs-7186903/v1 on Research Square
Jul 23, 2025

Ophtimus-V2-Tx: A Compact Domain-Specific LLM for Ophthalmic Diagnosis and Treatment Planning

This article has 7 authors:
1. Minwook Kwon
2. Kuk Jin Jang
3. Seung Ju Baek
4. Yong Seop Han
5. Hyonyoung Choi
6. Insup Lee
7. Jin Hyun Kim
This article has no evaluationsLatest version Aug 4, 2025
Combining Real and Synthetic Data to Overcome Limited Training Datasets in Multimodal Learning

This article has 5 authors:
1. Niccolo Marini
2. Zhaohui Liang
3. Sivaramakrishnan Rajaraman
4. Zhiyun Xue
5. Sameer Antani
This article has no evaluationsLatest version Jul 17, 2025
CLEVER: Clinical Large Language Model Evaluationby Expert Review

This article has 4 authors:
1. Veysel Kocaman
2. Mustafa Kaya
3. Andrei Ferrer
4. David Talby
This article has no evaluationsLatest version Jul 23, 2025

Listed in

Abstract

Article activity feed

Related articles

Ophtimus-V2-Tx: A Compact Domain-Specific LLM for Ophthalmic Diagnosis and Treatment Planning

Combining Real and Synthetic Data to Overcome Limited Training Datasets in Multimodal Learning

CLEVER: Clinical Large Language Model Evaluationby Expert Review