Comparison of Multimodal Large Language Models and Physicians for Medical Diagnosis Using NEJM Image Challenge Cases: Cross-sectional Study

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background Multimodal large language models (LLMs), capable of processing both images and text, may enhance diagnostic accuracy in clinical practice, particularly for rare diseases with limited diagnostic expertise. Methods We evaluated three multimodal LLMs, GPT-4o, Claude 3.7 Sonnet, and Doubao, using 272 cases across 11 comprehensive analyses from the New England Journal of Medicine (NEJM) Image Challenge (June 2009–March 2025). Each model was tested with images alone and combined with image-text inputs and compared with responses from 16,401,888 physicians worldwide (mean: 60,301 responses per case). Training data contamination was assessed by comparing the performance of the cases published before and after the respective training cutoffs. The primary outcome was the diagnostic accuracy of the multimodal testing. Results Temporal analysis revealed no evidence of training data contamination, with the models maintaining or improving their performance in the post-cutoff cases. All LLMs significantly outperformed physicians in multimodal testing (exact p < 0.000001 after multiple comparison corrections). The diagnostic accuracies were 89.0% (95% confidence interval [CI] Wilson method, 84.9–92.3) for Claude 3.7 Sonnet, 88.6% (95% CI Wilson method, 84.5–92.0) for GPT-4o, and 71.0% (95% CI Wilson method, 65.3–76.2) for Doubao, compared with 46.7% (95% CI Wilson method, 40.7–52.7) for the physician majority vote, with absolute differences exceeding 40 percentage points. In diagnostically challenging cases with < 40% physician consensus, Claude 3.7 Sonnet maintained 86.5% accuracy, versus 33.4% for physicians. Model-physician concordance was low (Cohen's κ, 0.08–0.24), with a 15.4:1 ratio of model-advantage to physician-advantage cases for Claude 3.7 Sonnet. Adding clinical text improved the accuracy by 28–42 percentage points across all models. At least one model was correct in 96.3% of the cases. Conclusions Multimodal LLMs demonstrated superior diagnostic performance compared to physicians across diverse clinical scenarios, with evidence suggesting authentic reasoning capabilities rather than training-data memorization. These findings support the potential use of multimodal AI as a diagnostic tool in clinical practice.

Article activity feed