Assessing The Performance of Multimodal Large Language Models in Diagnosing and Staging Diabetic Retinopathy: An External Validation Study of Large Language Models

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Diabetic retinopathy (DR) is a leading cause of visual impairment, requiring effective and scalable screening tools for early detection. Existing methods are complex, expensive, and reliant on specialized personnel, limiting their use in primary care. This study evaluates the potential of a multimodal large language model (LLM), for detecting DR, staging DR, and identifying diabetic maculopathy. This external validation study assessed the performance of LLMs using 228 fundus images captured at Tuanku Ampuan Najihah Hospital. Models evaluated include GPT-4, Google’s Gemini 1.5, Anthropic Claude 3 Haiku, and Mistral Large. Sensitivity, specificity, and predictive value were assessed, and results were validated with human ophthalmologist evaluations. As a results, GPT-4 achieved good sensitivity for detecting DR (82%) and referable DR (80%), meeting UK NICE criteria. However, all LLMs, including GPT-4, performed poorly in staging DR and detecting diabetic maculopathy. While GPT-4 shows promise in identifying DR, its limitations in detailed DR staging and maculopathy detection highlight cautious implementation.

Article activity feed