Large Language Model Benchmarks in Medical Tasks

Lawrence K.Q. Yan
Ming Li
Yichao Zhang
Caitlyn Heqi Yin
Cheng Fei
Benji Peng
Ziqian Bi
Pohsun Feng
Keyu Chen
Junyu Liu
Qian Niu

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

With the increasing application of large language models (LLMs) in the medical domain, evaluating these models' performance using benchmark datasets has become crucial. This paper presents a comprehensive survey of various benchmark datasets employed in medical LLM tasks. These datasets span multiple modalities including text, image, and multimodal benchmarks, focusing on different aspects of medical knowledge such as electronic health records (EHRs), doctor-patient dialogues, medical question-answering, and medical image captioning. The survey categorizes the datasets by modality, discussing their significance, data structure, and impact on the development of LLMs for clinical tasks such as diagnosis, report generation, and predictive decision support. Key benchmarks include MIMIC-III, MIMIC-IV, BioASQ, PubMedQA, and CheXpert, which have facilitated advancements in tasks like medical report generation, clinical summarization, and synthetic data generation. The paper summarizes the challenges and opportunities in leveraging these benchmarks for advancing multimodal medical intelligence, emphasizing the need for datasets with a greater degree of language diversity, structured omics data, and innovative approaches to synthesis. This work also provides a foundation for future research in the application of LLMs in medicine, contributing to the evolving field of medical artificial intelligence.

Version published to 10.31219/osf.io/8j7d3 on OSF Preprints
Oct 22, 2024

MultiMed-ST Datasets for Machine Translation in Medical Applications

This article has 2 authors:
1. Giridhar Gowda
2. Suma R
This article has no evaluationsLatest version Jan 9, 2026
Prompt-Orchestrated Large Language Models for Clinical Information Extraction

This article has 13 authors:
1. Livia Lilli
2. Andrea Rosati
3. Giovanni Paolo Tobia
4. Massimo Criscione
5. Federica Tomassini
6. Chiara Dachena
7. Alice Luraschi
8. Chiara Cantarini
9. Carolina De Maria
10. Luigi Congedo
11. Massimo Bernaschi
12. Stefano Patarnello
13. Anna Fagotti
This article has no evaluationsLatest version Jan 16, 2026
A Retrieval Augmented System for Cardiological Electronic Health Records.

This article has 9 authors:
1. Annamaria Defilippo
2. Giovanni Canino
3. Nicola Procopio
4. Albino Trapuzzano
5. Sabato Sorrentino
6. ciro Indolfi
7. Patrizia Vizza
8. Pierangelo Veltri
9. Pietro Hiram Guzzi
This article has no evaluationsLatest version Jan 9, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

MultiMed-ST Datasets for Machine Translation in Medical Applications

Prompt-Orchestrated Large Language Models for Clinical Information Extraction

A Retrieval Augmented System for Cardiological Electronic Health Records.