CLEVER: Clinical Large Language Model Evaluationby Expert Review

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The proliferation of both general-purpose and healthcare-specific Large Language Models (LLMs) has intensified the challenge of rigorous benchmarking. Existing evaluation methods face critical limitations: data contamination undermines the validity of public benchmarks, self-preference biases LLM-as-a-judge approaches, and current tasks do not fully reflect real-world clinical applications. To address these issues, we introduce CLEVER, a blind, randomized, preference-based evaluation methodology conducted by practicing medical doctors on task-specific assessments. We apply CLEVER to compare GPT-4o with two healthcare-specific LLMs (8B and 70B parameters) across three tasks: medical summarization, clinical information extraction, and biomedical question answering. Results reveal that domain-specific small LLMs outperform GPT-4o by 45% to 92% in factuality, clinical relevance, and conciseness, while maintaining comparable performance in open-ended medical Q&A. These findings challenge the assumption that larger general-purpose models inherently excel in specialized domains. To ensure the reliability and reproducibility of the evaluation, we curated a unique data set of 500 test cases, developed from scratch by four medical experts. Of these, we are sharing 100 test cases that were reserved specifically for reproducibility analysis, ensuring consistent evaluation across multiple runs. We validate the CLEVER methodology through interannotator agreement analysis, interclass correlation assessment, and washout period evaluation. This study highlights the importance of specialized LLMs in healthcare applications, emphasizing that domain-specific datasets and evaluation frameworks are crucial for obtaining actionable, real-world insights. CLEVER offers a reproducible, robust approach for evaluating LLMs in clinical contexts, contributing to better benchmarks and methodologies for future AI advancements in healthcare.

Article activity feed