Benchmarking Large Language Models on USMLE: Evaluating ChatGPT, DeepSeek, Grok, and Qwen in Clinical Reasoning and Medical Licensing Scenarios

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Artificial intelligence (AI) is transforming healthcare by assisting with intricate clinical reasoning and diagnosis. Recent research demonstrates that large language models (LLMs), such as ChatGPT and DeepSeek, possess considerable potential in medical comprehension. This study meticulously evaluates the clinical reasoning capabilities of four advanced LLMs, including ChatGPT, DeepSeek, Grok, and Qwen, utilizing the United States Medical Licensing Examination (USMLE) as a standard benchmark. We assess 376 publicly accessible USMLE sample exam questions (Step 1, Step 2 CK, Step 3) from the most recent booklet released in July 2023. We analyze model performance across four question categories—text-only, text with image, text with mathematical reasoning, and integrated text-image-mathematical reasoning—and measure model accuracy at three USMLE steps. Our findings indicate that on Step 2 CK, DeepSeek consistently outperforms other models, achieving a peak accuracy of 93%. Despite ChatGPT’s little latency, the restricted convergence in error patterns suggests that ensemble approaches might enhance effectiveness. Grok and Qwen demonstrate reduced and less dependable accuracy throughout all steps. These findings point out the importance of LLMs in clinical reasoning in medical licensing scenarios. However, we also emphasize that these procedures require improvement to ensure their safe and effective integration into practical healthcare processes.

Article activity feed