How does AI detect diabetic retinopathy from retinal photos? A heatmap analysis of 54 deep learning models

Timothy I. Murphy
James A. Armitage

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Purpose

To investigate how artificial intelligence (AI) systems detect referrable diabetic retinopathy (DR) from retinal photographs by analysing heatmap patterns and determining their overlap with DR features.

Methods

Fifty-four AI systems were developed using 27 backbone architectures, with each implemented as both binary-referable and multi-class grading models based on the International Clinical Diabetic Retinopathy (ICDR) grading scale. Models were trained on images from DDR, BRSET and Kaggle datasets. After training, each model analysed 749 images with DR feature annotations, with Grad-CAM heatmaps generated and compared to pixel-level annotations of microaneurysms, haemorrhages, exudates, cotton wool spots, venous beading, intraretinal microvascular abnormalities and neovascularisation.

Results

All models achieved acceptable predictive performance (AUROC >0.8 for most architectures). Heatmap analysis revealed consistent attention to the macular region with relative neglect of the optic disc. Exudates and cotton wool spots were highlighted most frequently by the heatmaps, with venous beading and neovascularisation at the disc showing poor overall coverage for binary referable classifiers. Models grading per the ICDR scale demonstrated high coverage for all features. Substantial variability was observed between architectures, suggesting different feature detection capabilities. Interestingly, the heatmap analysis indicated that the models were using different logic to the ICDR grading scale definitions.

Conclusion

AI models do not uniformly rely on all DR features when detecting referable DR, limiting their predictive performance in unusual presentations. Heatmap aggregation analysis provides a scalable method for analysing model behaviour, allowing strengths and weaknesses to be identified. These findings may help improve clinician’s trust and acceptance of AI.

Highlights

Artificial intelligence models prioritise exudates and cotton wool spots while underrepresenting critical features such as venous beading and neovascularisation.
Binary classifiers demonstrated poorer coverage of key diabetic retinopathy features compared to multi-class classifiers.
Heatmap aggregation enables scalable identification of AI model strengths and weaknesses.

Version published to 10.64898/2026.06.19.26356040 on medRxiv
Jun 22, 2026

Developing and Evaluating Deep Learning Approaches for Visual Field Denoising in Glaucoma

This article has 5 authors:
1. Julia Seungjoo Baek
2. Anagha Lokhande
3. Didier Neuenschwander
4. Min Shi
5. Mengyu Wang
This article has no evaluationsLatest version Jun 1, 2026
Can Demographic Information Be Reduced in Retinal Fundus Images While Preserving Glaucoma-Relevant Features?

This article has 2 authors:
1. Iyad Majid
2. Mengyu Wang
This article has no evaluationsLatest version Jun 25, 2026
Deep learning for interactive and automated inner retinal layer segmentation in OCT of patients with retinitis pigmentosa using limited training data

This article has 6 authors:
1. Dorothea Laurence
2. Martin Schilling
3. Nina-Antonia Grimm
4. Emilie Macé
5. Sebastian Bemme
6. Constantin Pape
This article has no evaluationsLatest version Jun 17, 2026

Discuss this preprint

Listed in

Abstract

Purpose

Methods

Results

Conclusion

Highlights

Article activity feed

Related articles

Developing and Evaluating Deep Learning Approaches for Visual Field Denoising in Glaucoma

Can Demographic Information Be Reduced in Retinal Fundus Images While Preserving Glaucoma-Relevant Features?

Deep learning for interactive and automated inner retinal layer segmentation in OCT of patients with retinitis pigmentosa using limited training data