Structured Taxonomy and Framework for Developing Medical Benchmark in Large Language Models Derived from Scoping Review

Junbok Lee
Jaeyong Shin
Belong Cho

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

With the rapid advancement of large language model technology, numerous studies have explored its application in the medical field. Robust evaluation is crucial for ensuring reliability and safety, leading to the development of diverse benchmark datasets. In this study, we propose a structured taxonomy to provide researchers with practical guidance for benchmark selection. Furthermore, we introduce READY, a development framework built on five principles - Reliable, Ethical, Annotated, Diverse, Yield-validated - to support the systematic design of medical benchmarks and strengthen future evaluation practices. To establish the taxonomy and framework, we systematically reviewed benchmark datasets designed for evaluating LLMs in medical context. A comprehensive literature search yielded 55 relevant studies. Each benchmark was analyzed using a structured framework encompassing the dataset construction and evaluation methodology. We anticipate that this research will promote more rigorous and ethical LLM evaluation, paving the way for the safe application of LLMs in clinical settings.

Version published to 10.21203/rs.3.rs-7927940/v1 on Research Square
Nov 14, 2025

Large Language Model Biases in Healthcare: A Scoping Review and Call for an Integrated Assessment Framework

This article has 8 authors:
1. Lu He
2. D. Phuong Do
3. Vishesh Girish Shet
4. Omar Farghaly
5. Priya Deshpande
6. Praveen Madiraju
7. Jiancheng Ye
8. Molly Beestrum
This article has no evaluationsLatest version Jan 16, 2026
Evaluating Large Language Models for Translating Caries Guidelines into Clinical Decision Support

This article has 8 authors:
1. Gu Nan
2. Bingxin Fan
3. Yao Yuan
4. Xinliang Duan
5. Sichen Han
6. Zhenyong Tang
7. Jiayu Shen
8. Zilin Wang
This article has no evaluationsLatest version Jan 28, 2026
Prompt-Orchestrated Large Language Models for Clinical Information Extraction

This article has 13 authors:
1. Livia Lilli
2. Andrea Rosati
3. Giovanni Paolo Tobia
4. Massimo Criscione
5. Federica Tomassini
6. Chiara Dachena
7. Alice Luraschi
8. Chiara Cantarini
9. Carolina De Maria
10. Luigi Congedo
11. Massimo Bernaschi
12. Stefano Patarnello
13. Anna Fagotti
This article has no evaluationsLatest version Jan 16, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Large Language Model Biases in Healthcare: A Scoping Review and Call for an Integrated Assessment Framework

Evaluating Large Language Models for Translating Caries Guidelines into Clinical Decision Support

Prompt-Orchestrated Large Language Models for Clinical Information Extraction