Benchmark Evaluation of Multi-Modal Large Language Models for Ophthalmic Diagnosis
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Multi-modal large language models (MLLMs) are increasingly demonstrating significant potential in medical applications, particularly in image-intensive fields such as ophthalmology. While state-of-the-art models like ChatGPT-4o and Qwen-VL 2.5 exhibit impressive performance in general-domain tasks, there remains a lack of real-world clinical benchmark datasets to rigorously evaluate their diagnostic capabilities in specialized medical contexts. To address this gap, we constructed a curated benchmark dataset comprising 295 pathologically confirmed ophthalmic cases with representative clinical presentations. Using this dataset, we conducted a systematic evaluation of nine leading MLLMs, both open-source and proprietary. Our results reveal that models such as HAIBU-REMUD, ChatGPT-4o and Gemini 2.5 achieve high diagnostic accuracy and strong consistency, with performance approaching that of human experts. These findings suggest that current MLLMs have reached a promising stage in terms of applicability to real-world clinical settings, laying the groundwork for their integration into ophthalmology practice.