Diagnostic Performance of Large Language Models and Radiologists in Case-Based Radiology Questions

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background Large language models (LLM) have demonstrated considerable potential in supporting medical decision-making. Until recently, LLM were restricted to text inputs, limiting their utility in image interpretation. The introduction of ChatGPT-4V, with capability of analyzing visual data, has opened new opportunities to evaluate LLM performance in radiological image interpretation. This study aims to investigate the performance of ChatGPT-4V in radiological image interpretation compared to two board-certified radiologists. The secondary aim of the study is to compare the accuracy of primary and differential diagnoses provided by three different LLM. Materials and Methods A total of 121 radiology cases were retrospectively retrieved from the Association of Academic Radiology “Case of the Month” archive. Each case consisted of the following sections: patient presentation, unlabeled findings, labeled findings, final diagnosis, case discussion, and references. Three LLMs —Chat GPT-3.5, Chat GPT-4o, Google Gemini 1.5 Pro— were provided with the patient presentation and labeled findings sections.Chat GPT-4V and two board certified radiologists was provided with the patient presentation and unlabeled findings sections, including radiological images. All comparisons were conducted separately for image based evaluations (Radiologist 1, Radiologist 2, and GPT-4V) and text-based evaluations (GPT-4o, GPT-3.5, and Gemini 1.5 Pro). Categorical variables indicating correct and incorrect responses were summarized as frequencies and percentages. Primary diagnosis and differential diagnosis accuracy were analyzed with McNemar’s and Cochran’s Q test. Results Both Radiologist 1 (72.5%) and Radiologist 2 (71.1%) significantly outperformed ChatGPT4V (38.8%)(p < 0.001). ChatGPT-3.5 achieved the highest primary diagnostic accuracy (80.9%), followed by ChatGPT-4o (78.5%) and Gemini 1.5 Pro (72.7%). For differential diagnoses, ChatGPT-4o achieved the highest accuracy (90.9%), slightly outperforming ChatGPT-3.5 (90.1%) and significantly exceeding Gemini 1.5 Pro (81.8%, p = 0.001). No significant difference was observed between ChatGPT-4o and ChatGPT-3.5. Conclusion LLMs demonstrated strong performance in generating primary and differential diagnoses based on text-based radiologic findings, with ChatGPT-3.5 and ChatGPT-4o outperforming Gemini 1.5 Pro. However, ChatGPT-4V showed substantially lower accuracy in direct radiological image interpretation compared to radiologists. While promising in text-based applications, LLMs require further development and validation.

Article activity feed