Resource-efficient medical vision language model for dermatology via a synthetic data generation framework
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Vision-language models (VLMs), with their ability to integrate visual and textual information, have enabled unified and interpretable multimodal reasoning. However, developing explainable, image-based artificial intelligence (AI) systems for medicine requires locally deployable models designed to ensure privacy-preserving data workflows. Here, we present SCALEMED (Scalable Clinical Assistants and LEarning for MEDicine), a modular framework that enables the development of locally deployable medical VLMs using small models and synthetic data. The SCALEMED framework integrates clinician data annotation, open-source image-text data collection, synthetic data generation through knowledge transfer using larger VLMs, and fine-tuning of small VLMs to develop domain-specific medical AI systems. As a use case in dermatology, we train a resource-efficient VLM, DermatoLlama, which demonstrates higher success rates in report generation compared to state-of-the-art VLMs across text and image-based evaluation datasets. DermatoLlama, based on Llama 3.2, was trained using DermaSynth, a dataset comprising 1.2 million synthetic text samples generated from 367 expert-crafted seed tasks and 82,379 open-source dermatological images. The SCALEMED framework offers a practical solution for developing explainable and accessible medical AI systems, particularly in resource-constrained healthcare environments.