Beyond Accuracy in Small Open-Source Medical Large Language Models for Pediatric Endocrinology

Vanessa D’Amario
Randy Daniel
Dhruv Edamadaka
Nitya Alaparthy
Joshua Tarkoff

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Objective

Small open-source medical large language models (LLMs) offer promising opportunities for low-resource deployment and broader accessibility. However, their evaluation is often limited to accuracy on medical multiple choice question benchmarks, and lacks evaluation of consistency, robustness, or reasoning behavior.

Materials and Methods

We evaluate six small open-source medical LLMs (HuatuoGPT-o1 ¹ , Diabetica-7B ² , Diabetica-o1 ² , Meditron3-8B ³ , MedFound-7B ⁴ , and ClinicaGPT-base-zh ⁵ ) in deterministic settings, where we examine how prompt variation and removal of option labels affect models output. In stochastic settings, we evaluate the variability of models responses and investigate the relationship between consistency and correctness. Lastly, we evaluate self-assessment bias by testing whether high-performing models can recognize the correct reasoning path when presented with gold-standard explanations. The response evaluation consists of a mix of human reviews and pediatric endocrinology expert.

Results

HuatuoGPT-o1-8B achieved the highest score with 32 correct responses of the 91 cases considered. All models exhibited high sensitivity to prompt phrasing (maximum level of agreement Cohen’s κ = 0.55) and to label removal (highest Cohen’s κ = 0.35). The results show that high consistency across the model response is not an indicator of the correctness of the model, although HuatuoGPT-o1-8B showed the highest consistency rate. When tasked with selecting correct reasoning, both HuatuoGPT-o1-8B and Diabetica-o1 exhibit self-assessment bias and dependency on the order of the candidate explanations. Expert review of incorrect reasoning rationales identified a mix of clinically acceptable responses and clinical oversight.

Discussion

Although none of the evaluated models fall short of demonstrating deep expertise in pediatric endocrinology, HuatuoGPT-o1-8B demonstrated the highest robustness to input variability and highest stability across variation of the hyperparameters used for inference.

Conclusion

This work underscores the limitations of relying solely on accuracy for evaluating medical LLMs and proposed a broader diagnostic framework to understand potential pitfalls in real-world clinical decision support.

Version published to 10.1101/2025.08.26.25334485 on medRxiv
Aug 27, 2025

QoQ-Med3: Robust Multimodal Clinical Analysis Foundation Model with Reasoning

This article has 10 authors:
1. David Dai
2. Jeannie She
3. Jiaee Cheong
4. Xing Han
5. Carl Harris
6. Haowen Wei
7. Farzan Vahedifard
8. Suchi Saria
9. Robert Stevens
10. Paul Liang
This article has no evaluationsLatest version Dec 30, 2025
An Adaptive Foundation Model with Evidence-based Clinical Reasoning for Gastroenterology

This article has 12 authors:
1. Yixuan Yuan
2. Wenting Chen
3. Shengyuan Liu
4. Boyun Zheng
5. Jipeng Zhang
6. Wenxuan Wang
7. Dejun Fan
8. Raymond Tang
9. Thomas Yuen Tung Lam
10. Shannon Melissa Chan
11. Lei Xing
12. Jiancong Hu
This article has no evaluationsLatest version Jan 21, 2026
Evaluating Large Language Models for Translating Caries Guidelines into Clinical Decision Support

This article has 8 authors:
1. Gu Nan
2. Bingxin Fan
3. Yao Yuan
4. Xinliang Duan
5. Sichen Han
6. Zhenyong Tang
7. Jiayu Shen
8. Zilin Wang
This article has no evaluationsLatest version Jan 28, 2026

Discuss this preprint

Listed in

Abstract

Objective

Materials and Methods

Results

Discussion

Conclusion

Article activity feed

Related articles

QoQ-Med3: Robust Multimodal Clinical Analysis Foundation Model with Reasoning

An Adaptive Foundation Model with Evidence-based Clinical Reasoning for Gastroenterology

Evaluating Large Language Models for Translating Caries Guidelines into Clinical Decision Support