Beyond Accuracy in Small Open-Source Medical Large Language Models for Pediatric Endocrinology
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Objective
Small open-source medical large language models (LLMs) offer promising opportunities for low-resource deployment and broader accessibility. However, their evaluation is often limited to accuracy on medical multiple choice question benchmarks, and lacks evaluation of consistency, robustness, or reasoning behavior.
Materials and Methods
We evaluate six small open-source medical LLMs (HuatuoGPT-o1 1 , Diabetica-7B 2 , Diabetica-o1 2 , Meditron3-8B 3 , MedFound-7B 4 , and ClinicaGPT-base-zh 5 ) in deterministic settings, where we examine how prompt variation and removal of option labels affect models output. In stochastic settings, we evaluate the variability of models responses and investigate the relationship between consistency and correctness. Lastly, we evaluate self-assessment bias by testing whether high-performing models can recognize the correct reasoning path when presented with gold-standard explanations. The response evaluation consists of a mix of human reviews and pediatric endocrinology expert.
Results
HuatuoGPT-o1-8B achieved the highest score with 32 correct responses of the 91 cases considered. All models exhibited high sensitivity to prompt phrasing (maximum level of agreement Cohen’s κ = 0.55) and to label removal (highest Cohen’s κ = 0.35). The results show that high consistency across the model response is not an indicator of the correctness of the model, although HuatuoGPT-o1-8B showed the highest consistency rate. When tasked with selecting correct reasoning, both HuatuoGPT-o1-8B and Diabetica-o1 exhibit self-assessment bias and dependency on the order of the candidate explanations. Expert review of incorrect reasoning rationales identified a mix of clinically acceptable responses and clinical oversight.
Discussion
Although none of the evaluated models fall short of demonstrating deep expertise in pediatric endocrinology, HuatuoGPT-o1-8B demonstrated the highest robustness to input variability and highest stability across variation of the hyperparameters used for inference.
Conclusion
This work underscores the limitations of relying solely on accuracy for evaluating medical LLMs and proposed a broader diagnostic framework to understand potential pitfalls in real-world clinical decision support.