Performance of Vision–Language Models Compared with 252 Medical Students on Text-only and Image-based Dermatology Examinations

Ozan Erdem
Abdurrahim Yilmaz
Ahmet Sait Sahin
Bugra Burc Dagtas
Ece Gokyayla
Melek Aslan Kayıran
Vefa Aslı Erdemir
Mehmet Salih Gurel

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Vision–language models (VLMs) are increasingly evaluated in medical education, yet their performance on visually intensive assessments remains incompletely understood. We compared four state-of-the-art VLMs (ChatGPT-4o, ChatGPT-5, Gemini 2.5 Flash, and Gemini 3 Pro) with fifth-year medical students on ten consecutive dermatology clerkship examinations administered between September 2023 and January 2025. Examinations combined text-only questions (multiple-choice, multiple-select, and matching; 60% of total score) with image-based, structured open-ended questions (40%) spanning seven dermatologic sub-domains. Model outputs were evaluated using expert-validated answer keys and grading rubrics, with repeated runs to assess output variability. All VLMs significantly outperformed medical students on text-only examinations (mean scores > 95 vs. 84.9; p < 0.001), showing minimal sensitivity to exam difficulty. In contrast, image-based performance was heterogeneous: Gemini 3 Pro and ChatGPT-5 achieved higher scores than students, whereas students significantly outperformed ChatGPT-4o and Gemini 2.5 Flash. Medical students demonstrated the smallest performance gap between text-only and image-based components, indicating greater cross-modal consistency. Sub-domain analyses revealed that some models achieved accurate visual description and diagnosis but showed reduced performance in etiological and treatment reasoning. Gemini 3 Pro exhibited the highest overall accuracy and the lowest output variability across repeated evaluations. These findings indicate that while VLMs excel in text-based dermatologic assessment, multimodal competence remains uneven and model-dependent, supporting their use as complementary rather than standalone tools in dermatology education.

Version published to 10.21203/rs.3.rs-8480126/v1 on Research Square
Apr 9, 2026

Evaluation of ChatGPT-4o’s and DeepSeek R1’s responses to urological problems: A comparative study

This article has 7 authors:
1. Hanbo Lu
2. Yusa Zhang
3. Zhan Wang
4. Yang Zhao
5. Jiang Liu
6. Dongxu Qiu
7. Yushi Zhang
This article has no evaluationsLatest version Apr 8, 2026
Psychometric Alignment Between Human and Artificial Intelligence Performance in Cardiology Residency In-Service Examinations

This article has 5 authors:
1. Aykan CELIK
2. Tuncay KIRIS
3. Ugur KOCABAS
4. Emre OZDEMIR
5. Mustafa KARACA
This article has no evaluationsLatest version Apr 6, 2026
Evaluating 11 Large Language Models in Answering Key Questions on Ovarian Cancer

This article has 7 authors:
1. Michela Quaranta
2. Yong Sheng Tan
3. Areti Karamanou
4. Evangelos Kalampokis
5. Nicolas M Orsi
6. Diederick DeJong
7. Alexandros Laios
This article has no evaluationsLatest version Apr 11, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Evaluation of ChatGPT-4o’s and DeepSeek R1’s responses to urological problems: A comparative study

Psychometric Alignment Between Human and Artificial Intelligence Performance in Cardiology Residency In-Service Examinations

Evaluating 11 Large Language Models in Answering Key Questions on Ovarian Cancer