Out-of-Distribution Performance Analysis of Skin Lesion Classifiers for dermoscopic images
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: The availability of public skin lesion image datasets has enabled rapid progress in classification tasks. However, models trained on datasets with similar characteristics, in-distribution (ID) data, often struggle to generalize to new and different data, limiting their utility in clinical settings. New methods are thus needed to assess algorithm performance and trustworthiness on out-of-distribution (OOD) data. Objective: This study aims to evaluate the generalization capacity and robustness of deep learning models for the binary classification (malignant vs non-malignant) of skin lesions by assessing their performance and predictive confidence in OOD settings. Methods: To this end, four convolutional neural networks (CNNs) —AlexNet, VGG, ResNet, and DenseNet— are trained using public datasets, which serve as the ID group. Their performance and reliability are then evaluated under distribution shifts by testing them on private datasets, considered OOD cohorts. Results: The VGG model achieves the best overall performance on the ID test set (AUROC = 0.895), maintaining balanced performance across OOD datasets. However, domain shift analysis reveals marked performance drops in specific domains, particularly those with strong distributional shifts in age and diagnosis. Conclusions: The results underscore the need for domain-aware evaluation and the development of models trained on more diverse and representative datasets to ensure generalization across clinically relevant populations.