Benchmarking Large Language Models on USMLE: Evaluating ChatGPT, DeepSeek, Grok, and Qwen in Clinical Reasoning and Medical Licensing Scenarios

Md Kamrul Siam
Angel Varela
Md Jobair Hossain Faruk
Jerry Q. Cheng
Huanying Gu
Abdullah Al Maruf
Zeyar Aung

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Artificial intelligence (AI) is transforming healthcare by assisting with intricate clinical reasoning and diagnosis. Recent research demonstrates that large language models (LLMs), such as ChatGPT and DeepSeek, possess considerable potential in medical comprehension. This study meticulously evaluates the clinical reasoning capabilities of four advanced LLMs, including ChatGPT, DeepSeek, Grok, and Qwen, utilizing the United States Medical Licensing Examination (USMLE) as a standard benchmark. We assess 376 publicly accessible USMLE sample exam questions (Step 1, Step 2 CK, Step 3) from the most recent booklet released in July 2023. We analyze model performance across four question categories—text-only, text with image, text with mathematical reasoning, and integrated text-image-mathematical reasoning—and measure model accuracy at three USMLE steps. Our findings indicate that on Step 2 CK, DeepSeek consistently outperforms other models, achieving a peak accuracy of 93%. Despite ChatGPT’s little latency, the restricted convergence in error patterns suggests that ensemble approaches might enhance effectiveness. Grok and Qwen demonstrate reduced and less dependable accuracy throughout all steps. These findings point out the importance of LLMs in clinical reasoning in medical licensing scenarios. However, we also emphasize that these procedures require improvement to ensure their safe and effective integration into practical healthcare processes.

Version published to 10.21203/rs.3.rs-6651111/v1 on Research Square
Jun 12, 2025

CLEVER: Clinical Large Language Model Evaluationby Expert Review

This article has 4 authors:
1. Veysel Kocaman
2. Mustafa Kaya
3. Andrei Ferrer
4. David Talby
This article has no evaluationsLatest version Jul 23, 2025
BioPars: A Pretrained Biomedical Large Language Model for Persian Biomedical Text Mining

This article has 6 authors:
1. Baqer M. Merzah
2. Tania Taami
3. Salman Asoudeh
4. Amir reza Hossein pour
5. Saeed Mirzaee
6. Amir Ali Bengari
This article has no evaluationsLatest version Jul 21, 2025
A Comprehensive and Critical Survey of Large Language Model Inference and Feature Generation

This article has 1 author:
1. Snehil Shrivastava
This article has no evaluationsLatest version Jun 16, 2025

Listed in

Abstract

Article activity feed

Related articles

CLEVER: Clinical Large Language Model Evaluationby Expert Review

BioPars: A Pretrained Biomedical Large Language Model for Persian Biomedical Text Mining

A Comprehensive and Critical Survey of Large Language Model Inference and Feature Generation