A statistical framework for evaluating the repeatability and reproducibility of large language models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
A major concern in applying large language models (LLMs) to medicine is their reliability. Because LLMs generate text by sampling the next token (or word) from a probability distribution, the stochastic nature of this process can lead to different outputs even when the input prompt, model architecture, and parameters remain the same. Variation in model output has important implications for reliability in medical applications, yet it remains underexplored and lacks standardized metrics. To address this gap, we propose a statistical framework that systematically quantifies LLM variability using two metrics: repeatability, the consistency of LLM responses across repeated runs under identical conditions, and reproducibility, the consistency across runs under different conditions. Within these metrics, we evaluate two complementary dimensions: semantic consistency, which measures the similarity in meaning across responses, and internal stability, which measures the stability of the model’s underlying token-generating process. We applied this framework to medical reasoning as a use case, evaluating LLM repeatability and reproducibility on standardized United States Medical Licensing Examination (USMLE) questions and real-world rare disease cases from the Undiagnosed Diseases Network (UDN) using validated medical reasoning prompts. LLM responses were less variable for UDN cases than for USMLE questions, suggesting that the complexity and ambiguity of real-world patient presentations may constrain the model’s output space and yield more stable reasoning. Repeatability and reproducibility did not correlate with diagnostic accuracy, underscoring that an LLM producing a correct answer is not equivalent to producing it consistently. By providing a systematic approach to quantifying LLM repeatability and reproducibility, our framework supports more reliable use of LLMs in medicine and biomedical research.