Benchmark Evaluation of Multi-Modal Large Language Models for Ophthalmic Diagnosis

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Multi-modal large language models (MLLMs) are increasingly demonstrating significant potential in medical applications, particularly in image-intensive fields such as ophthalmology. While state-of-the-art models like ChatGPT-4o and Qwen-VL 2.5 exhibit impressive performance in general-domain tasks, there remains a lack of real-world clinical benchmark datasets to rigorously evaluate their diagnostic capabilities in specialized medical contexts. To address this gap, we constructed a curated benchmark dataset comprising 295 pathologically confirmed ophthalmic cases with representative clinical presentations. Using this dataset, we conducted a systematic evaluation of nine leading MLLMs, both open-source and proprietary. Our results reveal that models such as HAIBU-REMUD, ChatGPT-4o and Gemini 2.5 achieve high diagnostic accuracy and strong consistency, with performance approaching that of human experts. These findings suggest that current MLLMs have reached a promising stage in terms of applicability to real-world clinical settings, laying the groundwork for their integration into ophthalmology practice.

Article activity feed