Artificial Intelligence Meets Real-Life Dermatology: Diagnostic Accuracy Assessment in a Retrospective Case Series
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: Although artificial intelligence (AI) has shown considerable promise in dermatological diagnostics, its real-world clinical validation remains limited. This study aimed to evaluate the diagnostic accuracy and clinical decision-support capabilities of GPT-4.5 in a routine outpatient dermatology setting. Methods: A total of 402 dermatologic cases from 400 patients were retrospectively analyzed at a secondary-care dermatology clinic. GPT-4.5 was provided with clinical and dermoscopic images, along with brief metadata (e.g., age, lesion location, duration), to generate differential diagnoses and management suggestions. Model outputs were compared with dermatologist assessments. Performance metrics included diagnostic accuracy, sensitivity, specificity, precision, and F1 score. Misclassification patterns were also reviewed. Results: GPT-4.5 achieved an overall diagnostic accuracy of 89.3% and correctly identified the primary diagnosis as its top-ranked suggestion in 71.9% of cases. Sensitivity and specificity were 89.7% and 91.4%, respectively, with an F1 score of 94.3%. Clinical guidance recommendations were concordant with physician decisions in 91.0% of cases. Diagnostic accuracy was higher in non-biopsied cases (96.0%) compared to those requiring histopathological confirmation (84.2%). Highest performance was observed in infectious (94.3%) and inflammatory (96.2%) dermatoses. Misclassifications were most common in pigmented neoplasms and morphologically similar inflammatory disorders. Conclusion: GPT-4.5 demonstrated high diagnostic accuracy and strong clinical alignment in outpatient dermatology, especially for common and visually distinct conditions. However, performance declined in diagnostically complex or ambiguous cases. These findings support its potential as a supplementary clinical tool, while underscoring the need for multimodal inputs, physician oversight, and broader prospective validation prior to clinical integration.