Large Language Models in Radiology Exams: A Comparative Analysis of Performance in Turkish and English

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background : The primary objective of this study is to evaluate the success levels of Large Language Models regarding radiology-related questions and to analyze performance variations between the Turkish and English languages. Furthermore, the consistency of the models' responses to the same questions over different time periods was examined, and the obtained data were analyzed in comparison with the performance levels of radiology residents. Materials and Methods: This study evaluated the performance of ChatGPT-5, Grok-4, Claude 4.5 Sonnet, and Gemini 2.5 Pro using 100 multiple-choice radiology questions across five subspecialties. To assess linguistic impact, ChatGPT-5 and Gemini 2.5 Pro were tested in both Turkish and English. Temporal reliability was examined by re-testing ChatGPT-5, Claude 4.5 Sonnet, and Grok-4 after a one-week interval. Finally, AI outputs were benchmarked against a control group of 18 radiology residents (1–3 years of seniority). Results: Gemini 2.5 Pro achieved the highest accuracy (89%), followed by Claude 4.5 Sonnet (86%), ChatGPT-5 (85%), and Grok-4 (84%). All LLMs and 3rd-year residents (75.8%) significantly outperformed 1st-year (58.7%) and 2nd-year (66%) residents. Subspecialty analysis showed 3rd-year residents excelled in musculoskeletal radiology, while Claude 4.5 and Gemini 2.5 Pro significantly surpassed 1st-year residents in abdominal radiology. No significant performance gap was found between Turkish and English outputs for ChatGPT-5 and Gemini 2.5 Pro (p = 1.000), indicating good linguistic agreement (κ ≈ 0.73). Regarding temporal reliability, Claude 4.5 Sonnet demonstrated “very good” consistency over one week (κ = 0.872), whereas Grok-4 (κ = 0.575) and ChatGPT-5 (κ = 0.559) showed only “moderate” reliability. Conclusion: Our findings demonstrate that high-performance LLMs, such as Gemini 2.5 Pro, ChatGPT-5, and Grok-4, provide fundamental radiology knowledge with high accuracy and comparable efficiency. These models show significant potential as supportive tools for optimizing radiology medical education. However, further research incorporating image-based datasets is essential to determine their actual clinical efficacy in real-world radiological practice.

Article activity feed