Multimodal Large Language Models Challenge NEJM Image Challenge
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Theoretically, multimodal large language models better reflect real-world clinical scenarios in disease diagnosis compared to text-only large language models. The New England Journal of Medicine Image Challenge contains real clinical cases with images and textual materials, making it the optimal resource for testing multimodal LLM diagnostic accuracy. Methods We analyzed 272 Image Challenge cases (June 2009 to March 2025) containing both images and clinical text. Three LLMs—GPT-4o, Claude 3.7, and Doubao—were evaluated against responses from 16,401,888 physicians worldwide (mean, 60,301 per case). Models were tested with images alone and with combined image-text inputs. The primary outcome was diagnostic accuracy in the multimodal condition. Results All LLMs significantly outperformed physicians (P < 0.001). Diagnostic accuracy in multimodal testing was 89.0% (95% CI, 84.9 to 92.3) with Claude 3.7, 88.6% (95% CI, 84.5 to 92.0) with GPT-4o, and 71.0% (95% CI, 65.3 to 76.2) with Doubao, compared with 46.7% (95% CI, 40.7 to 52.7) for physician majority vote—an absolute difference exceeding 40 percentage points for top-performing models. In diagnostically challenging cases where fewer than 40% of physicians were correct, Claude 3.7 maintained 86.5% accuracy versus 33.4% for physicians. Despite high accuracy, model-physician concordance was low (Cohen's κ, 0.08 to 0.24), with a 15.4:1 ratio of model-advantage to physician-advantage cases for Claude 3.7. Adding clinical text to images improved accuracy by 28 to 42 percentage points across models. At least one model was correct in 96.3% of cases. Conclusions Multimodal testing achieved significantly higher diagnostic accuracy than image-only evaluation and substantially exceeded physician diagnostic performance. High AI accuracy coupled with low physician-AI concordance indicates that multimodal large language models utilize fundamentally different diagnostic reasoning processes. These findings suggest multimodal LLMs may function as valuable diagnostic assistants, augmenting rather than replacing physician clinical decision-making.