Diagnostic Performance of Large Language Models and Radiologists in Case-Based Radiology Questions

Raif Can Yarol
Ali Cantürk
Kenan Kadirli
Aslı Suner Karakulah
Oğuz Dicle

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background Large language models (LLM) have demonstrated considerable potential in supporting medical decision-making. Until recently, LLM were restricted to text inputs, limiting their utility in image interpretation. The introduction of ChatGPT-4V, with capability of analyzing visual data, has opened new opportunities to evaluate LLM performance in radiological image interpretation. This study aims to investigate the performance of ChatGPT-4V in radiological image interpretation compared to two board-certified radiologists. The secondary aim of the study is to compare the accuracy of primary and differential diagnoses provided by three different LLM. Materials and Methods A total of 121 radiology cases were retrospectively retrieved from the Association of Academic Radiology “Case of the Month” archive. Each case consisted of the following sections: patient presentation, unlabeled findings, labeled findings, final diagnosis, case discussion, and references. Three LLMs —Chat GPT-3.5, Chat GPT-4o, Google Gemini 1.5 Pro— were provided with the patient presentation and labeled findings sections.Chat GPT-4V and two board certified radiologists was provided with the patient presentation and unlabeled findings sections, including radiological images. All comparisons were conducted separately for image based evaluations (Radiologist 1, Radiologist 2, and GPT-4V) and text-based evaluations (GPT-4o, GPT-3.5, and Gemini 1.5 Pro). Categorical variables indicating correct and incorrect responses were summarized as frequencies and percentages. Primary diagnosis and differential diagnosis accuracy were analyzed with McNemar’s and Cochran’s Q test. Results Both Radiologist 1 (72.5%) and Radiologist 2 (71.1%) significantly outperformed ChatGPT4V (38.8%)(p < 0.001). ChatGPT-3.5 achieved the highest primary diagnostic accuracy (80.9%), followed by ChatGPT-4o (78.5%) and Gemini 1.5 Pro (72.7%). For differential diagnoses, ChatGPT-4o achieved the highest accuracy (90.9%), slightly outperforming ChatGPT-3.5 (90.1%) and significantly exceeding Gemini 1.5 Pro (81.8%, p = 0.001). No significant difference was observed between ChatGPT-4o and ChatGPT-3.5. Conclusion LLMs demonstrated strong performance in generating primary and differential diagnoses based on text-based radiologic findings, with ChatGPT-3.5 and ChatGPT-4o outperforming Gemini 1.5 Pro. However, ChatGPT-4V showed substantially lower accuracy in direct radiological image interpretation compared to radiologists. While promising in text-based applications, LLMs require further development and validation.

Version published to 10.21203/rs.3.rs-8618169/v1 on Research Square
Mar 13, 2026

Detecting Errors in Coronary Computed Tomography Angiography Reports: Comparison Among Three Large Language Models and Human Reader

This article has 7 authors:
1. Jing Chen
2. Yihao Wang
3. Linlin Sun
4. Li Zhu
5. Huiyuan Zhu
6. Xingxing Cen
7. Hong Yu
This article has no evaluationsLatest version Mar 11, 2026
Concordance Between the DeepSeek-V3 Language Model and Multidisciplinary Team Recommendations in Lung Cancer: A Retrospective Study

This article has 7 authors:
1. Yihan ZHao
2. Fangqi Yuan
3. Lingli Wang
4. Meifang Wang
5. Long Zhang
6. Tao Ren
7. Hansheng Wang
This article has no evaluationsLatest version Apr 10, 2026
Performance of Vision–Language Models Compared with 252 Medical Students on Text-only and Image-based Dermatology Examinations

This article has 8 authors:
1. Ozan Erdem
2. Abdurrahim Yilmaz
3. Ahmet Sait Sahin
4. Bugra Burc Dagtas
5. Ece Gokyayla
6. Melek Aslan Kayıran
7. Vefa Aslı Erdemir
8. Mehmet Salih Gurel
This article has no evaluationsLatest version Apr 9, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Detecting Errors in Coronary Computed Tomography Angiography Reports: Comparison Among Three Large Language Models and Human Reader

Concordance Between the DeepSeek-V3 Language Model and Multidisciplinary Team Recommendations in Lung Cancer: A Retrospective Study

Performance of Vision–Language Models Compared with 252 Medical Students on Text-only and Image-based Dermatology Examinations