CLEVER: Clinical Large Language Model Evaluationby Expert Review

Veysel Kocaman
Mustafa Kaya
Andrei Ferrer
David Talby

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The proliferation of both general-purpose and healthcare-specific Large Language Models (LLMs) has intensified the challenge of rigorous benchmarking. Existing evaluation methods face critical limitations: data contamination undermines the validity of public benchmarks, self-preference biases LLM-as-a-judge approaches, and current tasks do not fully reflect real-world clinical applications. To address these issues, we introduce CLEVER, a blind, randomized, preference-based evaluation methodology conducted by practicing medical doctors on task-specific assessments. We apply CLEVER to compare GPT-4o with two healthcare-specific LLMs (8B and 70B parameters) across three tasks: medical summarization, clinical information extraction, and biomedical question answering. Results reveal that domain-specific small LLMs outperform GPT-4o by 45% to 92% in factuality, clinical relevance, and conciseness, while maintaining comparable performance in open-ended medical Q&A. These findings challenge the assumption that larger general-purpose models inherently excel in specialized domains. To ensure the reliability and reproducibility of the evaluation, we curated a unique data set of 500 test cases, developed from scratch by four medical experts. Of these, we are sharing 100 test cases that were reserved specifically for reproducibility analysis, ensuring consistent evaluation across multiple runs. We validate the CLEVER methodology through interannotator agreement analysis, interclass correlation assessment, and washout period evaluation. This study highlights the importance of specialized LLMs in healthcare applications, emphasizing that domain-specific datasets and evaluation frameworks are crucial for obtaining actionable, real-world insights. CLEVER offers a reproducible, robust approach for evaluating LLMs in clinical contexts, contributing to better benchmarks and methodologies for future AI advancements in healthcare.

Version published to 10.21203/rs.3.rs-6883531/v1 on Research Square
Jul 23, 2025

How Good Are Large Language Models at Supporting Frontline Healthcare Workers in Low-Resource Settings – A Benchmarking Study & Dataset

This article has 13 authors:
1. Samuel Rutunda
2. Gwydion Williams
3. Kleber Kabanda
4. Francis Nkurunziz
5. Solange Uwiduhaye
6. Eulade Rugegamanzi
7. Cyprien Nshimiyimana
8. Vaishnavi Menon
9. Mira Emmanuel-Fabula
10. Alastair K. Denniston
11. Xiaoxuan Liu
12. Emery Hezagira
13. Bilal A. Mateen
This article has no evaluationsLatest version Aug 28, 2025
Physician Evaluations of Large Language Model-Generated Responses to Medical Questions by Region and Years in Practice: A preliminary study

This article has 8 authors:
1. James Brooks
2. Paa-Kwesi Blankson
3. Peter Murphy Campbell
4. R Adams Cowley
5. Tsorng-Shyang Yang
6. Tijani Oseni
7. Anny Rodriguez
8. Muhammed Y. Idris
This article has no evaluationsLatest version Aug 19, 2025
Beyond Accuracy in Small Open-Source Medical Large Language Models for Pediatric Endocrinology

This article has 5 authors:
1. Vanessa D’Amario
2. Randy Daniel
3. Dhruv Edamadaka
4. Nitya Alaparthy
5. Joshua Tarkoff
This article has no evaluationsLatest version Aug 27, 2025

Listed in

Abstract

Article activity feed

Related articles

How Good Are Large Language Models at Supporting Frontline Healthcare Workers in Low-Resource Settings – A Benchmarking Study & Dataset

Physician Evaluations of Large Language Model-Generated Responses to Medical Questions by Region and Years in Practice: A preliminary study

Beyond Accuracy in Small Open-Source Medical Large Language Models for Pediatric Endocrinology