Beyond the Leaderboard: The Limitations of LLM Benchmarks and the Case for Real-World Clinical Evaluation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This article critically examines the limitations of current large language model (LLM) benchmarks, particularly in healthcare and clinical evaluation. While standardised leaderboards and benchmarks have driven rapid technical progress and shaped industry perceptions, they are increasingly undermined by issues like benchmark data contamination and narrow assessment criteria. The article explains that benchmark leakage can overstate results, and traditional evaluation using multiple-choice questions does not reflect the complexity of clinical practice. Specialised medical benchmarks, though more targeted, still overlook essential attributes such as reliability, calibration, and safety, and often lack representation of diverse healthcare contexts and languages. A shift toward real-world evaluation frameworks, emphasising scenario-based simulations, multisite validation, and comprehensive translational assessment, is required. The Translational Evaluation of Healthcare AI (TEHAI) framework is presented as a robust alternative that integrates technical, utility, and adoption criteria and explicitly addresses ethical and contextual factors. Genuine clinical benefit and patient safety can be ensured only through continuous, context-specific evaluation that transcends traditional benchmarking.