Clinical Application of Vision Transformers for Melanoma Classification: A Multi-Dataset Evaluation Study
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: Melanoma is among the most lethal skin cancers, with survival heavily dependent on early detection, yet diagnosis remains challenging due to its resemblance to benign nevi. Despite their success in automated dermoscopy, convolutional neural networks remain limited by their focus on local features and their dependence on fixed input sizes, which can constrain generalization. Vision Transformers, which model global image context through self-attention, offer a promising alternative. Methods: A ViT-L/16 model was fine-tuned using the ISIC 2019 dataset of over 25,000 dermoscopic images. To expand the dataset and balance class representation, synthetic nevus and melanoma images were generated with StyleGAN2-ADA, with only high-confidence outputs retained. Performance was assessed on an external biopsy-confirmed dataset (MN187) and compared with CNN baselines (ResNet-152, DenseNet-201, EfficientNet-B7, ConvNeXt-XL, ViT-B/16) and the commercial MoleAnalyzer Pro system using ROC-AUC and DeLong’s test. Results: The ViT-L/16 model achieved the highest baseline ROC-AUC of 0.902 on MN187, exceeding the performance of CNN models and MoleAnalyzer Pro, though this difference was not statistically significant (p = 0.07). The incorporation of 46,000 confidence-filtered GAN-generated images increased the ROC-AUC to 0.915, producing a statistically significant improvement (p = 0.032). Conclusions: Vision Transformers show strong potential for melanoma classification, especially when combined with GAN-based augmentation, offering advantages in global feature representation and data expansion that support the development of reliable AI-driven clinical decision-support systems.