Assessing the Capability of Large Language Models in Answering Pediatric Critical Care Board-Style Questions

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: The potential of Large Language Models (LLMs) in medicine is often linked to massive, resource-intensive models. However, their practical application in specialized fields like pediatric critical care requires exploring the capability of more efficient, locally-deployable open-source alternatives. In this study, we evaluated the accuracy and clinical reasoning of open-source LLMs of varying sizes, specifically assessing if smaller, efficient models can perform comparably to larger ones on pediatric critical care multiple-choice questions. Methods: A set of 100 pediatric critical care MCQs across six clinical domains, i.e. calculation, diagnosis, ethics, management, pharmacology, and physiology, was curated by two pediatric specialists to evaluate eight open-source LLMs, ranging from 2 to 70 billion parameters. The LLMs were assessed using the overall and category-specific accuracy and clinical reasoning quality score based on a 5-points Likert scale. Additionally, two pediatric critical care fellows completed the MCQs for comparison. Cochran’s Q test, McNemar’s test, the Friedman test, and Cohen’s kappa were used for the statistical analysis. Results: While the largest model (Llama-3.3-70B) achieved the highest accuracy (78%; 95% CI, 69%-86%), a key finding was the performance of the much smaller, 14.7-billion parameter Phi-4. This efficient model was strikingly comparable, with 75% accuracy (95% CI, 65%-83%) and a similar reasoning score (4.40 vs 4.49/5). Both models’ performance was on par with pediatric critical care fellows. The LLMs excelled in ethics but struggled with calculations. Inter-rater reliability was excellent for the clinical reasoning assessment (κ = 0.92). Conclusions: Our findings demonstrate that smaller, efficient LLMs can approach the performance of much larger models and pediatric critical care fellows for complex pediatric critical care reasoning. This suggests a viable pathway for developing secure, locally-deployable decision support tools without relying on massive, proprietary systems. At the same time, these models hold potential as complementary resources for trainee education in pediatric critical care. However, their identified weaknesses, especially in calculations, underscore that rigorous, domain-specific validation is an essential prerequisite to ensure safe use in both clinical and educational contexts. Trial registration: Not applicable.

Article activity feed