Resource-efficient medical vision language model for dermatology via a synthetic data generation framework

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Vision-language models (VLMs), with their ability to integrate visual and textual information, have enabled unified and interpretable multimodal reasoning. However, developing explainable, image-based artificial intelligence (AI) systems for medicine requires locally deployable models designed to ensure privacy-preserving data workflows. Here, we present SCALEMED (Scalable Clinical Assistants and LEarning for MEDicine), a modular framework that enables the development of locally deployable medical VLMs using small models and synthetic data. The SCALEMED framework integrates clinician data annotation, open-source image-text data collection, synthetic data generation through knowledge transfer using larger VLMs, and fine-tuning of small VLMs to develop domain-specific medical AI systems. As a use case in dermatology, we train a resource-efficient VLM, DermatoLlama, which demonstrates higher success rates in report generation compared to state-of-the-art VLMs across text and image-based evaluation datasets. DermatoLlama, based on Llama 3.2, was trained using DermaSynth, a dataset comprising 1.2 million synthetic text samples generated from 367 expert-crafted seed tasks and 82,379 open-source dermatological images. The SCALEMED framework offers a practical solution for developing explainable and accessible medical AI systems, particularly in resource-constrained healthcare environments.

Article activity feed