Comparison of Multimodal Large Language Models and Physicians for Medical Diagnosis Using NEJM Image Challenge Cases: Cross-sectional Study

Chiyu Sheng
Shumin Shen
Lin Wang
Wei Chen
Shanghu Wang
Nianfei Wang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background Multimodal large language models (LLMs), capable of processing both images and text, may enhance diagnostic accuracy in clinical practice, particularly for rare diseases with limited diagnostic expertise. Methods We evaluated three multimodal LLMs, GPT-4o, Claude 3.7 Sonnet, and Doubao, using 272 cases across 11 comprehensive analyses from the New England Journal of Medicine (NEJM) Image Challenge (June 2009–March 2025). Each model was tested with images alone and combined with image-text inputs and compared with responses from 16,401,888 physicians worldwide (mean: 60,301 responses per case). Training data contamination was assessed by comparing the performance of the cases published before and after the respective training cutoffs. The primary outcome was the diagnostic accuracy of the multimodal testing. Results Temporal analysis revealed no evidence of training data contamination, with the models maintaining or improving their performance in the post-cutoff cases. All LLMs significantly outperformed physicians in multimodal testing (exact p < 0.000001 after multiple comparison corrections). The diagnostic accuracies were 89.0% (95% confidence interval [CI] Wilson method, 84.9–92.3) for Claude 3.7 Sonnet, 88.6% (95% CI Wilson method, 84.5–92.0) for GPT-4o, and 71.0% (95% CI Wilson method, 65.3–76.2) for Doubao, compared with 46.7% (95% CI Wilson method, 40.7–52.7) for the physician majority vote, with absolute differences exceeding 40 percentage points. In diagnostically challenging cases with < 40% physician consensus, Claude 3.7 Sonnet maintained 86.5% accuracy, versus 33.4% for physicians. Model-physician concordance was low (Cohen's κ, 0.08–0.24), with a 15.4:1 ratio of model-advantage to physician-advantage cases for Claude 3.7 Sonnet. Adding clinical text improved the accuracy by 28–42 percentage points across all models. At least one model was correct in 96.3% of the cases. Conclusions Multimodal LLMs demonstrated superior diagnostic performance compared to physicians across diverse clinical scenarios, with evidence suggesting authentic reasoning capabilities rather than training-data memorization. These findings support the potential use of multimodal AI as a diagnostic tool in clinical practice.

Version published to 10.21203/rs.3.rs-7175790/v1 on Research Square
Sep 1, 2025

Comparative Evaluation of State‑of‑the‑Art Large Language Models for Patient Education Prior to Interventional Radiology procedures

This article has 9 authors:
1. Bogdan Levita
2. Semil Eminovic
3. Willie Magnus Lüdemann
4. Dirk Schnapauff
5. Robin Schmidt
6. Anna-Maria Haack
7. Andrea Dell’Orco
8. Jawed Nawabi
9. Tobias Penzkofer
This article has no evaluationsLatest version Aug 22, 2025
Comparison of Large Language Model and Manual Review for Clinical Data Curation in Breast Cancer

This article has 12 authors:
1. Young-Joon Kang
2. Hocheol Lee
3. Jae Pak Yi
4. Hyobin Kim
5. Chang Ik Yoon
6. Jong Min Baek
7. Yong-seok Kim
8. Ye Won Jeon
9. Jiyoung Rhu
10. Su Hyun Lim
11. Hoon Choi
12. Se Jeong Oh
This article has no evaluationsLatest version Sep 1, 2025
Performance of a Large Language Model in BI-RADS Classification of Ultrasound Based Breast Lesions

This article has 2 authors:
1. Kathryn Pillai
2. Fauzia Nausheen
This article has no evaluationsLatest version Oct 1, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Comparative Evaluation of State‑of‑the‑Art Large Language Models for Patient Education Prior to Interventional Radiology procedures

Comparison of Large Language Model and Manual Review for Clinical Data Curation in Breast Cancer

Performance of a Large Language Model in BI-RADS Classification of Ultrasound Based Breast Lesions