Improving the Quality of Skin Lesion Data for Training Vision-Language Models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Skin cancer diagnosis using machine learning faces significant challenges, primarily due to the lack of well-labelled and balanced skin lesion datasets. Most available datasets are limited to only two lesion types; melanoma and nevi, or they exhibit severe class and skin tone imbalance leading to biases during model training. Furthermore, vision-language models (VLMs) face an additional challenge in training using these datasets as they also lack semantic labelling for effective training. To address these challenges, several researchers have used general adversarial networks (GANs) to generate realistic synthetic images. While this can improve the diagnostic accuracy, it raises ethical and trust concerns especially in clinical settings. Moreover, applying GANs on imbalanced datasets amplifies the existing biases. This paper proposes an alternative approach by curating and combining the existing public datasets; HAM10000 and BCN20000, into a single well-labelled dataset called RHB, optimised for training Google’s Gemma 3 4B model.