AI Performance on Image-based Medical Case Scenarios: A Cross-Sectional Comparative Study
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Large language models (LLMs) have shown remarkable progress in text-based tasks, but their ability to interpret and respond to image-based clinical scenarios remains underexplored. This study evaluated and compared the performance of ChatGPT-5 and Claude in answering subjective image-based medical case questions. Methods A cross-sectional comparative study was conducted using 71 dermatological case scenarios subjective questions designed by the research team. Each AI system generated responses to identical visual and textual inputs without external assistance. Two experienced dermatologists, blinded to model identity, independently scored the responses against standard answers. Inter-rater reliability was assessed using intraclass correlation coefficients (ICC), and comparative analyses employed Mann–Whitney U tests, Bland–Altman plots, and correlation metrics. Results Both evaluators demonstrated excellent inter-rater reliability (ICC > 0.86). Claude achieved higher mean scores (27.39 ± 11.44) than ChatGPT-5 (25.53 ± 11.45; p < 0.001). Claude also showed stronger correlation with reference standards (ρ = 0.88 vs. 0.83), lower mean absolute error (14.76% vs. 19.98%), and reduced root mean square error (7.24 vs. 9.24). Bland–Altman analysis revealed minimal systematic bias between evaluators, indicating consistent scoring reliability. Conclusions Both multimodal LLMs demonstrated strong competence in interpreting image-based medical scenarios. Claude exhibited a modest but consistent advantage in diagnostic reasoning and clinical alignment. These findings support the potential of LLMs as supplementary educational tools in visual disciplines such as dermatology, emphasizing the importance of model selection, supervised use, and continued evaluation as AI integration in medical education expands.