The Role of Prompt Engineering for Multimodal LLM Glaucoma Diagnosis

Reem Agbareia
Mahmud Omar
Ofira Zloto
Nisha Chandala
Tania Tai
Benjamin S Glicksberg
Girish N Nadkarni
Eyal Klang

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background and Aim

This study evaluates the diagnostic performance of multimodal large language models (LLMs), GPT-4o and Claude Sonnet 3.5, in detecting glaucoma from fundus images. We specifically assess the impact of prompt engineering and the use of reference images on model performance.

Methods

We utilized the ACRIMA public dataset, comprising 705 labeled fundus images, and designed four prompt types, ranging from simple instructions to more refined prompts with reference images. The two model were tested across 5640 API runs, with accuracy, sensitivity, specificity, PPV, and NPV assessed through non-parametric statistical tests.

Results

Claude Sonnet 3.5 achieved a highest sensitivity of 94.92%, a specificity of 73.46%, and F1 score of 0.726. GPT-4o reached a highest sensitivity of 81.47%, a specificity of 50.49%, and F1 score of 0.645. The incorporation of prompt engineering and reference images improved GPT-4o’s accuracy by 39.8% and Claude Sonnet 3.5’s by 64.2%, significantly enhancing both models’ performance.

Conclusion

Multimodal LLMs demonstrated potential in diagnosing glaucoma, with Claude Sonnet 3.5 achieving a sensitivity of 94.92%, far exceeding the 22% sensitivity reported for primary care physicians in the literature. Prompt engineering, especially with reference images, significantly improved diagnostic performance. As LLMs become more integrated into medical practice, efficient prompt design may be key, and training doctors to use these tools effectively could enhance clinical outcomes.

Version published to 10.1101/2024.10.30.24316434v1 on medRxiv
Nov 1, 2024

Multimodal LLMs for Retinal Disease Diagnosis via OCT: Few-Shot vs Single-Shot Learning

This article has 6 authors:
1. Reem Agbareia
2. Mahmud Omar
3. Ofira Zloto
4. Benjamin S Glicksberg
5. Girish N Nadkarni
6. Eyal Klang
This article has no evaluationsLatest version Nov 4, 2024
A Comparative Study on Deep Convolutional Neural Networks and Histogram Equalization Techniques for Glaucoma Detection From Fundus Images

This article has 2 authors:
1. Ashish Kulkarni
2. H Shafeeq Ahmed
This article has no evaluationsLatest version Oct 30, 2024
Efficacy of lightweight Vision Transformers in diagnosis of pneumonia

This article has 1 author:
1. Muhammad Tayyeb Bukhari
This article has no evaluationsLatest version Oct 24, 2024

Listed in

Abstract

Background and Aim

Methods

Results

Conclusion

Article activity feed

Related articles

Multimodal LLMs for Retinal Disease Diagnosis via OCT: Few-Shot vs Single-Shot Learning

A Comparative Study on Deep Convolutional Neural Networks and Histogram Equalization Techniques for Glaucoma Detection From Fundus Images

Efficacy of lightweight Vision Transformers in diagnosis of pneumonia