Beyond Accuracy in Small Open-Source Medical Large Language Models for Pediatric Endocrinology

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Objective

Small open-source medical large language models (LLMs) offer promising opportunities for low-resource deployment and broader accessibility. However, their evaluation is often limited to accuracy on medical multiple choice question benchmarks, and lacks evaluation of consistency, robustness, or reasoning behavior.

Materials and Methods

We evaluate six small open-source medical LLMs (HuatuoGPT-o1 1 , Diabetica-7B 2 , Diabetica-o1 2 , Meditron3-8B 3 , MedFound-7B 4 , and ClinicaGPT-base-zh 5 ) in deterministic settings, where we examine how prompt variation and removal of option labels affect models output. In stochastic settings, we evaluate the variability of models responses and investigate the relationship between consistency and correctness. Lastly, we evaluate self-assessment bias by testing whether high-performing models can recognize the correct reasoning path when presented with gold-standard explanations. The response evaluation consists of a mix of human reviews and pediatric endocrinology expert.

Results

HuatuoGPT-o1-8B achieved the highest score with 32 correct responses of the 91 cases considered. All models exhibited high sensitivity to prompt phrasing (maximum level of agreement Cohen’s κ = 0.55) and to label removal (highest Cohen’s κ = 0.35). The results show that high consistency across the model response is not an indicator of the correctness of the model, although HuatuoGPT-o1-8B showed the highest consistency rate. When tasked with selecting correct reasoning, both HuatuoGPT-o1-8B and Diabetica-o1 exhibit self-assessment bias and dependency on the order of the candidate explanations. Expert review of incorrect reasoning rationales identified a mix of clinically acceptable responses and clinical oversight.

Discussion

Although none of the evaluated models fall short of demonstrating deep expertise in pediatric endocrinology, HuatuoGPT-o1-8B demonstrated the highest robustness to input variability and highest stability across variation of the hyperparameters used for inference.

Conclusion

This work underscores the limitations of relying solely on accuracy for evaluating medical LLMs and proposed a broader diagnostic framework to understand potential pitfalls in real-world clinical decision support.

Article activity feed