Performance of Vision–Language Models Compared with 252 Medical Students on Text-only and Image-based Dermatology Examinations

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Vision–language models (VLMs) are increasingly evaluated in medical education, yet their performance on visually intensive assessments remains incompletely understood. We compared four state-of-the-art VLMs (ChatGPT-4o, ChatGPT-5, Gemini 2.5 Flash, and Gemini 3 Pro) with fifth-year medical students on ten consecutive dermatology clerkship examinations administered between September 2023 and January 2025. Examinations combined text-only questions (multiple-choice, multiple-select, and matching; 60% of total score) with image-based, structured open-ended questions (40%) spanning seven dermatologic sub-domains. Model outputs were evaluated using expert-validated answer keys and grading rubrics, with repeated runs to assess output variability. All VLMs significantly outperformed medical students on text-only examinations (mean scores > 95 vs. 84.9; p < 0.001), showing minimal sensitivity to exam difficulty. In contrast, image-based performance was heterogeneous: Gemini 3 Pro and ChatGPT-5 achieved higher scores than students, whereas students significantly outperformed ChatGPT-4o and Gemini 2.5 Flash. Medical students demonstrated the smallest performance gap between text-only and image-based components, indicating greater cross-modal consistency. Sub-domain analyses revealed that some models achieved accurate visual description and diagnosis but showed reduced performance in etiological and treatment reasoning. Gemini 3 Pro exhibited the highest overall accuracy and the lowest output variability across repeated evaluations. These findings indicate that while VLMs excel in text-based dermatologic assessment, multimodal competence remains uneven and model-dependent, supporting their use as complementary rather than standalone tools in dermatology education.

Article activity feed