How Far Have Large Language Models Advanced in Ophthalmology? A Systematic Review of Their Development, Evaluation, and Readiness for Clinical Use

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large language models (LLMs) are rapidly transforming ophthalmology, with expanding applications in patient care, clinical documentation, and medical education. Recent studies span a wide range of use cases, from early text-only applications to emerging multimodal systems that integrate ophthalmic images to support diagnosis and generate assessment and treatment plans. Amid this rapid progress, it is critical for both researchers and clinicians to stay informed in order to guide responsible development and adoption. However, prior reviews have largely focused on narrow domains such as an inventory of potential use cases or performance on board-style examinations, leaving the broader landscape insufficiently characterized. Key questions remain unanswered: How are LLMs in ophthalmology being developed? What applications and evaluation strategies are being pursued? And which areas are closest to real-world clinical adoption? To date, these aspects have not been comprehensively examined. In this study, we conducted a systematic review on LLMs in ophthalmology by manually screening 1,029 studies from PubMed/PMC, Scopus, and Embase published between January 1, 2022, and April 1, 2025, identifying 91 relevant articles. To provide a standardized assessment, we introduced a structured framework that categorizes ophthalmic use cases and stratifies evaluation rigor across five levels of maturity. Each study was manually annotated using 27 structured variables spanning multiple dimensions: scope and purpose (e.g., study aim, ophthalmic subspecialty, input modality); model architecture and training (e.g., backbone LLMs, domain-specific adaptations); evaluation and validation (e.g., target applications, evaluation metrics, level of clinical validation); and resource availability (e.g., model access, licensing, dataset availability). We additionally performed a small-scale, illustrative evaluation of representative emerging models, such as GPT-5.2, gpt-oss-120B, and Gemini 3, to contextualize previously reported results on commonly used ophthalmology tasks. The results show that most studies focused on general-purpose proprietary models, such as GPT-4 and Gemini, while fewer than 10% introduced domain-specific adaptations for ophthalmology, including only 4\% that developed ophthalmology-specific architectures for text-based applications. Multimodal LLMs remain relatively underexplored, with only 23% of studies incorporating imaging data. Evaluation practices reveal a significant translational gap: While 57.1% of studies relied on standard benchmarking and expert review, only 9.9% conducted retrospective validation using real-world clinical data, and just two studies progressed to prospective pilot evaluation. Moreover, although model performance on benchmarks on board-style exams and clinical vignettes has improved with newer model generations, reproducibility and transparency remain limited: only 5.5% of studies released evaluation code, and 33% used publicly available datasets. Finally, we provide a living repository to track the rapid progress of LLMs in ophthalmology for the broader research and clinical community.

Article activity feed