Exploring AI’s Potential in Papilledema Diagnosis to Support Dermatological Treatment Decisions in Rural Healthcare
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (PREreview)
Abstract
Background: Papilledema, an ophthalmic finding associated with increased intracranial pressure, is often induced by dermatological medications, including corticosteroids, isotretinoin, and tetracyclines. Early detection is crucial for preventing irreversible optic nerve damage, but access to ophthalmologic expertise is often limited in rural settings. Artificial intelligence (AI) may enable the automated and accurate detection of papilledema from fundus images, thereby supporting timely diagnosis and management. Objective: The primary objective of this study was to explore the diagnostic capability of ChatGPT-4o, a general large language model with multimodal input, in identifying papilledema from fundus photographs. For context, its performance was compared with a ResNet-based convolutional neural network (CNN) specifically fine-tuned for ophthalmic imaging, as well as with the assessments of two human ophthalmologists. The focus was on applications relevant to dermatological care in resource-limited environments. Methods: A dataset of 1094 fundus images (295 papilledema, 799 normal) was preprocessed and partitioned into a training set and a test set. The ResNet model was fine-tuned using discriminative learning rates and a one-cycle learning rate policy. GPT-4o and two human evaluators (a senior ophthalmologist and an ophthalmology resident) independently assessed the test images. Diagnostic metrics including sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, and Cohen’s Kappa, were calculated for each evaluator. Results: GPT-4o, when applied to papilledema detection, achieved an overall accuracy of 85.9% with substantial agreement beyond chance (Cohen’s Kappa = 0.72), but lower specificity (78.9%) and positive predictive value (73.7%) compared to benchmark models. For context, the ResNet model, fine-tuned for ophthalmic imaging, reached near-perfect accuracy (99.5%, Kappa = 0.99), while two human ophthalmologists achieved accuracies of 96.0% (Kappa ≈ 0.92). Conclusions: This study explored the capability of GPT-4o, a large language model with multimodal input, for detecting papilledema from fundus photographs. GPT-4o achieved moderate diagnostic accuracy and substantial agreement with the ground truth, but it underperformed compared to both a domain-specific ResNet model and human ophthalmologists. These findings underscore the distinction between generalist large language models and specialized diagnostic AI: while GPT-4o is not optimized for ophthalmic imaging, its accessibility, adaptability, and rapid evolution highlight its potential as a future adjunct in clinical screening, particularly in underserved settings. These findings also underscore the need for validation on external datasets and real-world clinical environments before such tools can be broadly implemented.
Article activity feed
-
-
This Zenodo record is a permanently preserved version of a PREreview. You can view the complete PREreview at https://prereview.org/reviews/16949097.
Exploring AI's Potential in Papilledema Diagnosis to Support Dermatological Treatment Decisions in Rural Healthcare
Brief summary of the study
This study evaluates the potential of AI to detect papilledema from fundus photographs in the context of dermatological treatment decisions in rural healthcare. The authors compared a fine-tuned ResNet CNN and GPT-4 against two human ophthalmologists using 1,389 fundus images. The ResNet model achieved the highest performance, with 99.49% accuracy and 100% specificity, surpassing both human experts (95.96% accuracy) and GPT-4 (85.86% accuracy). These findings highlight AI's promise for supporting early papilledema detection, particularly in …
This Zenodo record is a permanently preserved version of a PREreview. You can view the complete PREreview at https://prereview.org/reviews/16949097.
Exploring AI's Potential in Papilledema Diagnosis to Support Dermatological Treatment Decisions in Rural Healthcare
Brief summary of the study
This study evaluates the potential of AI to detect papilledema from fundus photographs in the context of dermatological treatment decisions in rural healthcare. The authors compared a fine-tuned ResNet CNN and GPT-4 against two human ophthalmologists using 1,389 fundus images. The ResNet model achieved the highest performance, with 99.49% accuracy and 100% specificity, surpassing both human experts (95.96% accuracy) and GPT-4 (85.86% accuracy). These findings highlight AI's promise for supporting early papilledema detection, particularly in resource-limited settings where specialist access is scarce.
The study situates its contribution within existing literature on AI in ophthalmology and dermatology, extending it by focusing on drug-induced papilledema risk in dermatology patients. The authors conclude that while AI models, especially ResNet, show strong diagnostic potential, validation on diverse real-world datasets remains necessary. The most interesting aspect is the demonstration that AI can outperform specialists in a critical, vision-threatening condition, offering tangible benefits for healthcare equity in underserved areas.
Major comments
Comments on the strengths of methods employed and discussion written
The test set included only papilledema and normal cases, while pseudo-papilledema (clinically important) was excluded. This reduces clinical realism. It would be better to clearly state this limitation in the abstract and discussion, and if feasible, include pseudo-papilledema in future test sets to reflect real-world diagnostic challenges.
GPT-4's training exposure to ophthalmic images is unknown, limiting interpretability of its results. Please clarify in the methods that GPT-4 was treated as a "black-box comparator" and emphasize in the discussion that its lower performance should not discredit LLMs broadly but highlight the need for domain-specific fine-tuning.
While the dataset is well described, it is not openly available, and the source code is not shared. This limits reproducibility and independent validation. Please provide a code repository (e.g., GitHub, Zenodo) with preprocessing scripts and model training details. If full data sharing is not possible due to ethics, consider providing a de-identified subset or synthetic dataset for benchmarking.
Minor comments
Comments on interpretation of the results, presentation of the data/figures
Confusion matrices could be clearer, and fundus images are relatively of low resolution. It would be better to consider adding explicit labels for "true positive/false negative" in confusion matrices and provide higher-resolution fundus images.
Economic considerations and rural healthcare integration of AI are mentioned only briefly. Add a short paragraph on cost-effectiveness, the feasibility of smartphone fundus cameras, and workflow integration in rural care.
While limitations are discussed (e.g., exclusion of pseudo-papilledema, single dataset) are not prominently linked back to clinical practice. Reframe this limitation in terms of clinical impact (e.g., how exclusion of pseudo-papilledema may overestimate accuracy).
Conflicts of interest of reviewers
None declared
Data and code availability
Data was not openly available; the authors note it is accessible upon request due to ethical restrictions. No source code link was provided; this reduces reproducibility.
Ethical clearance and approval
Ethical approval was reported with certificate number well outlined
Comments by section
Title
The title, "Exploring AI's Potential in Papilledema Diagnosis to Support Dermatological Treatment Decisions in Rural Healthcare," appropriately reflects the study's scope. It is specific and highlights both the technical and clinical dimensions.
Abstract
The abstract clearly states the research question (comparing AI models and human ophthalmologists for papilledema detection in dermatology-related care). It also outlines the approach (ResNet CNN vs GPT-4 vs humans) and the key findings (ResNet outperforming both). However, the abstract could benefit from a shorter contextual sentence linking papilledema more explicitly to dermatological drug risks (currently implied but not emphasized).
Introduction
The introduction summarized the research problem well: papilledema as a vision-threatening condition, its relevance in dermatology (drug-induced intracranial hypertension), and the lack of access to ophthalmologists in rural settings. The research question was also situated within AI's growing role in ophthalmology. The authors also referenced relevant and the most recent literature.
Materials and methods
The dataset (1,389 fundus images) was clearly described, with preprocessing steps (contrast normalization, cropping, and resizing). Training methods (ResNet fine-tuning with discriminative learning rates and one-cycle policy) were appropriate for the limited dataset. GPT-4 evaluation was also included for comparison.
Statistical methods (sensitivity, specificity, PPV, NPV, accuracy, Cohen's Kappa, and two-sample proportion tests) were appropriate and correctly reported.
Results were consistent across text and tables. However, explicit labeling of "true positive/false negative" in confusion matrices would be better for clarity purposes.
Discussion & conclusions
The discussion appropriately concludes that ResNet outperformed humans and GPT-4, with strong evidence (accuracy and Kappa values). In addition, the authors could further expand on cost-effectiveness and implementation challenges of AI in rural healthcare
Competing interests
No competing interest from authors
Competing interests
The author declares that they have no competing interests.
-
-