Evaluation of Large Language Models in Medical Examinations: A Scoping Review Protocol

Weiqi Wang
Baifeng Wang
Yan Zhu
Zhe Wang
Suyuan Peng

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Introduction

Large language models (LLMs) demonstrate human-level performance in three key domains: linguistic understanding, knowledge-based reasoning, and complex problem-solving. These characteristics make LLMs valuable tools for medical education. Standardized medical examinations evaluate clinical competencies in trainees. These examinations allow rigorous verification of LLMs’ accuracy and reliability in medical contexts. Current methods use standardized examinations to test LLMs’ clinical reasoning abilities. Significant performance variations emerge across different clinical scenarios. No comprehensive reviews have compared different LLM versions in medical examinations. Most studies focus on individual models, lacking comparative analyses of multiple LLM versions. Current approaches struggle to keep pace with evolving research needs. This study synthesizes extant research on LLMs in medical examinations, by analyzing the current challenges and limitations, offers guidance for future investigations.

Methods and analysis

The protocol was designed following the JBI Manual for Evidence Synthesis guidelines. We established explicit inclusion/exclusion criteria and search strategies. Systematic searches were performed in PubMed and Web of Science Core Collection databases. The methodology details literature screening, data extraction, analysis frameworks, and process mapping. This approach ensures methodological rigor throughout the research process.

Ethics and dissemination

This protocol outlines a scoping review methodology. The study involves systematic synthesis and analysis of published literature. It does not include human/animal experimentation or sensitive data collection. Ethical approval is not required for this literature-based study.

Strengths and limitations of this study

This scoping review programme strictly adheres to the standardized guidelines for the implementation of scoping reviews. Includes the JBI Manual for Evidence Synthesis and the Preferred Reporting Items for Systematic Reviews and Scoping Reviews Extended Meta-Analysis (PRISMA-ScR) guideline.

The search strategy included two databases:PubMed, Web of Science Core Collection.

This scoping review will bridge the knowledge gap of LLMs across medical examinations due to recent rapid technological advances.

By the nature of the scoping review, failure to critically evaluate identified sources of evidence.

The results of the scoping review will serve as a basis for identifying directions for further research on LLMs in the field of medical examinations.

Version published to 10.1101/2025.06.11.25329442 on medRxiv
Jun 12, 2025

Large Language Model Biases in Healthcare: A Scoping Review and Call for an Integrated Assessment Framework

This article has 8 authors:
1. Lu He
2. D. Phuong Do
3. Vishesh Girish Shet
4. Omar Farghaly
5. Priya Deshpande
6. Praveen Madiraju
7. Jiancheng Ye
8. Molly Beestrum
This article has no evaluationsLatest version Jan 16, 2026
Artificial Intelligence in Clinical Practice: Evaluating Chatbot Performance on Board-Level Questions in Geriatrics

This article has 2 authors:
1. Mert Zure
2. Metin Sökmen
This article has no evaluationsLatest version Jan 21, 2026
Evaluating Large Language Models for Translating Caries Guidelines into Clinical Decision Support

This article has 8 authors:
1. Gu Nan
2. Bingxin Fan
3. Yao Yuan
4. Xinliang Duan
5. Sichen Han
6. Zhenyong Tang
7. Jiayu Shen
8. Zilin Wang
This article has no evaluationsLatest version Jan 28, 2026

Discuss this preprint

Listed in

Abstract

Introduction

Methods and analysis

Ethics and dissemination

Strengths and limitations of this study

Article activity feed

Related articles

Large Language Model Biases in Healthcare: A Scoping Review and Call for an Integrated Assessment Framework

Artificial Intelligence in Clinical Practice: Evaluating Chatbot Performance on Board-Level Questions in Geriatrics

Evaluating Large Language Models for Translating Caries Guidelines into Clinical Decision Support