A Retrospective Analysis of a Dermatology-Trained Multimodal Large Language Model's Diagnostic Accuracy in Pigmented Skin Lesions
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: Artificial intelligence (AI) has shown significant promise in augmenting di-agnostic capabilities across medical specialties. Recent advancements in generative AI allow for synthesis and interpretation of complex clinical data including imaging and patient history to assess disease risk. Objective: To evaluate the diagnostic performance of a dermatology-trained multimodal large language model (DermFlow, Delaware, USA) in assessing malignancy risk of pig-mented skin lesions. Methods: This retrospective study utilized data from 59 patients with 68 biopsy-proven pigmented skin lesions seen at Indiana University clinics from February 2023 to May 2025. De-identified patient histories and clinical images were input into DermFlow, and clinical images only were input into Claude Sonnet 4 (Claude) to generate differential diagnoses. Clinician pre-operative diagnoses were extracted from the clinical note. Assessments were compared to histopathologic diagnoses (gold standard). Results: Among 68 clinically concerning pigmented lesions, DermFlow achieved diagnostic performance of 47.1% accuracy for top diagnosis and 92.6% accuracy when the correct diagnosis was included within the limited differential (any-diagnosis accuracy). Claude performed significantly worse with 8.8% top diagnosis accuracy and 73.5% any-diagnosis accuracy. Clinicians achieved 38.2% top diagnosis accuracy and 72.1% any-diagnosis accuracy. DermFlow recommended biopsy in 95.6% of cases compared to 82.4% for Claude. Statistical analysis revealed several significant differences between DermFlow and both comparators (p < 0.05). Conclusions: DermFlow demonstrated comparable or superior diagnostic performance to clinicians and superior performance to Claude in evaluating pigmented skin lesions. Although additional data must be gathered to further validate the model in real clinical settings, these initial findings suggest potential utility for dermatology-trained AI models in clinical practice, particularly in settings with limited dermatologist availability.