CARDBiomedBench: A Benchmark for Evaluating Large Language Model Performance in Biomedical Research

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Backgrounds: Biomedical research requires sophisticated understanding and reasoning across multiple specializations. While large language models (LLMs) show promise in scientific applications, their capability to safely and accurately support complex biomedical research remains uncertain. Methods: We present CARDBiomedBench, a novel question-and-answer benchmark for evaluating LLMs in biomedical research. For our pilot implementation, we focus on neurodegenerative diseases (NDDs), a domain requiring integration of genetic, molecular, and clinical knowledge. The benchmark combines expert-annotated question-answer (Q/A) pairs with semi-automated data augmentation, drawing from authoritative public resources including drug development data, genome-wide association studies (GWAS), and Summary-data based Mendelian Randomization (SMR) analyses. We evaluated seven private and open-source LLMs across ten biological categories and nine reasoning skills, using novel metrics to assess both response quality and safety. Results: Our benchmark comprises over 68,000 Q/A pairs, enabling robust evaluation of LLM performance. Current state-of-the-art models show significant limitations: models like Claude-3.5-Sonnet demonstrates excessive caution (Response Quality Rate: 25% [95% CI: 25% +/- 1], Safety Rate: 76% +/- 1), while others like ChatGPT-4o exhibits both poor accuracy and unsafe behavior (Response Quality Rate: 37% +/- 1, Safety Rate: 31% +/- 1). These findings reveal fundamental gaps in LLMs' ability to handle complex biomedical information. Conclusion: CARDBiomedBench establishes a rigorous standard for assessing LLM capabilities in biomedical research. Our pilot evaluation in the NDD domain reveals critical limitations in current models' ability to safely and accurately process complex scientific information. Future iterations will expand to other biomedical domains, supporting the development of more reliable AI systems for accelerating scientific discovery.

Article activity feed