An interdisciplinary, randomized, single-blind evaluation of state-of-the-art large language models for their implications and risks in medical diagnosis and management

Peikai Chen
Jifu Cai
Jiaying Zhou
Shaoxi Chen
Chenguang Xu
Lihua Yuan
Xiaoying Dai
Xiaowei Chen
Yanzhe Wei
Xia Li
Shaofeng Gong
Xiaolong Liang
Jiancheng Yang
Jun Jin
Kanglin Dai
Yuzhen Cui
Guan-Ming Kuang
Jianshen Xie
Libing Luo
Haibing Xiao
Shijie Yin
Jun Yang
Yulan Yan
Jianliang Chen
Yihua Chen
Qianshen Zhang
Qingshan Zhou
Lina Zhao
Min Wu
Xin Tang
Lei Rong
Zanxin Wang
Weifu Qiu
Yanli Wang
Liwen Cui
Xiangyang Li
Yong Hu
Huiren Tao
Nan Wu
Pearl Pai
Minxin Wei
Michael Kai-tsun To
Kenneth M.C. Cheung

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

State-of-the-art (SOTA) large language models (LLMs) are poised to revolutionize clinical medicine by transforming diagnostic, therapeutic, and interdisciplinary reasoning. Despite their promising capabilities, rigorous benchmarking of these models is essential to address concerns about their clinical proficiency and safety, particularly in high-risk environments.

Methods

This study implemented a multi-disciplinary, randomized, single-blind evaluation framework involving 27 experienced specialty clinicians with an average of 25.9 years of practice. The assessment covered 685 simulated and real clinical cases across 13 subspecialties, including both common and rare conditions. Evaluators rated LLM responses on medical strength (0–10 scale, where > 9.5 signified leading expert proficiency) and hallucination severity (0–5 scale for fabricated or misleading medical elements). Seven SOTA LLMs were tested, including top-ranked models from the ARENA leaderboard, with statistical analyses applied to adjust for confounders such as response length.

Findings

The evaluation revealed clinical plausibility in general-purpose LLMs, with Gemini 2.0 Flash leading raw scores and DeepSeek R1 excelling in adjusted analyses. Top models demonstrated proficiency comparable to a physician of 6 years post qualification experience (score ∼6.0), yet significant risks were noted. Instances of incompetence (scores ≤4) were detected across specialties, and 40 hallucination instances involving fabricated conditions, medications, and classification errors. These findings underscore the importance of implementing stringent safeguards to mitigate potential adverse outcomes in clinical applications.

Interpretation

While SOTA LLMs show substantial promise in enhancing clinical reasoning and decision-making, their unguarded application in medicine could present serious risks, such as misinformation and diagnostic errors. Human expert oversight remains crucial, particularly given reported incompetence and hallucination risks. Larger, multi-center studies are warranted to evaluate their real-world performance and track their evolution before broader clinical adoption.

Version published to 10.1101/2025.06.20.25326623 on medRxiv
Jun 24, 2025

QoQ-Med3: Robust Multimodal Clinical Analysis Foundation Model with Reasoning

This article has 10 authors:
1. David Dai
2. Jeannie She
3. Jiaee Cheong
4. Xing Han
5. Carl Harris
6. Haowen Wei
7. Farzan Vahedifard
8. Suchi Saria
9. Robert Stevens
10. Paul Liang
This article has no evaluationsLatest version Dec 30, 2025
Evaluating Large Language Models for Translating Caries Guidelines into Clinical Decision Support

This article has 8 authors:
1. Gu Nan
2. Bingxin Fan
3. Yao Yuan
4. Xinliang Duan
5. Sichen Han
6. Zhenyong Tang
7. Jiayu Shen
8. Zilin Wang
This article has no evaluationsLatest version Jan 28, 2026
Defining and Characterising a Model of Care for the Assessment, Diagnosis and Management of Tic Disorders in Children and Young People: A Delphi Study

This article has 3 authors:
1. Jaxon Kramer
2. Leanne Makarem
3. Mike Jackson
This article has no evaluationsLatest version Dec 18, 2025

Discuss this preprint

Listed in

Abstract

Background

Methods

Findings

Interpretation

Article activity feed

Related articles

QoQ-Med3: Robust Multimodal Clinical Analysis Foundation Model with Reasoning

Evaluating Large Language Models for Translating Caries Guidelines into Clinical Decision Support

Defining and Characterising a Model of Care for the Assessment, Diagnosis and Management of Tic Disorders in Children and Young People: A Delphi Study