How does AI detect diabetic retinopathy from retinal photos? A heatmap analysis of 54 deep learning models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Purpose
To investigate how artificial intelligence (AI) systems detect referrable diabetic retinopathy (DR) from retinal photographs by analysing heatmap patterns and determining their overlap with DR features.
Methods
Fifty-four AI systems were developed using 27 backbone architectures, with each implemented as both binary-referable and multi-class grading models based on the International Clinical Diabetic Retinopathy (ICDR) grading scale. Models were trained on images from DDR, BRSET and Kaggle datasets. After training, each model analysed 749 images with DR feature annotations, with Grad-CAM heatmaps generated and compared to pixel-level annotations of microaneurysms, haemorrhages, exudates, cotton wool spots, venous beading, intraretinal microvascular abnormalities and neovascularisation.
Results
All models achieved acceptable predictive performance (AUROC >0.8 for most architectures). Heatmap analysis revealed consistent attention to the macular region with relative neglect of the optic disc. Exudates and cotton wool spots were highlighted most frequently by the heatmaps, with venous beading and neovascularisation at the disc showing poor overall coverage for binary referable classifiers. Models grading per the ICDR scale demonstrated high coverage for all features. Substantial variability was observed between architectures, suggesting different feature detection capabilities. Interestingly, the heatmap analysis indicated that the models were using different logic to the ICDR grading scale definitions.
Conclusion
AI models do not uniformly rely on all DR features when detecting referable DR, limiting their predictive performance in unusual presentations. Heatmap aggregation analysis provides a scalable method for analysing model behaviour, allowing strengths and weaknesses to be identified. These findings may help improve clinician’s trust and acceptance of AI.
Highlights
-
Artificial intelligence models prioritise exudates and cotton wool spots while underrepresenting critical features such as venous beading and neovascularisation.
-
Binary classifiers demonstrated poorer coverage of key diabetic retinopathy features compared to multi-class classifiers.
-
Heatmap aggregation enables scalable identification of AI model strengths and weaknesses.