Assessing the Capability of Large Language Models in Answering Pediatric Critical Care Board-Style Questions

Daniela Chanci
Ronald Moore
Henry P. Foote
Matthew A. Goldstein
Karan R. Kumar
Alexandre T. Rotta
Christoph P. Hornik
Marybeth Burriss-West
Makenzie Hamilton
Rishikesan Kamaleswaran

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background: The potential of Large Language Models (LLMs) in medicine is often linked to massive, resource-intensive models. However, their practical application in specialized fields like pediatric critical care requires exploring the capability of more efficient, locally-deployable open-source alternatives. In this study, we evaluated the accuracy and clinical reasoning of open-source LLMs of varying sizes, specifically assessing if smaller, efficient models can perform comparably to larger ones on pediatric critical care multiple-choice questions. Methods: A set of 100 pediatric critical care MCQs across six clinical domains, i.e. calculation, diagnosis, ethics, management, pharmacology, and physiology, was curated by two pediatric specialists to evaluate eight open-source LLMs, ranging from 2 to 70 billion parameters. The LLMs were assessed using the overall and category-specific accuracy and clinical reasoning quality score based on a 5-points Likert scale. Additionally, two pediatric critical care fellows completed the MCQs for comparison. Cochran’s Q test, McNemar’s test, the Friedman test, and Cohen’s kappa were used for the statistical analysis. Results: While the largest model (Llama-3.3-70B) achieved the highest accuracy (78%; 95% CI, 69%-86%), a key finding was the performance of the much smaller, 14.7-billion parameter Phi-4. This efficient model was strikingly comparable, with 75% accuracy (95% CI, 65%-83%) and a similar reasoning score (4.40 vs 4.49/5). Both models’ performance was on par with pediatric critical care fellows. The LLMs excelled in ethics but struggled with calculations. Inter-rater reliability was excellent for the clinical reasoning assessment (κ = 0.92). Conclusions: Our findings demonstrate that smaller, efficient LLMs can approach the performance of much larger models and pediatric critical care fellows for complex pediatric critical care reasoning. This suggests a viable pathway for developing secure, locally-deployable decision support tools without relying on massive, proprietary systems. At the same time, these models hold potential as complementary resources for trainee education in pediatric critical care. However, their identified weaknesses, especially in calculations, underscore that rigorous, domain-specific validation is an essential prerequisite to ensure safe use in both clinical and educational contexts. Trial registration: Not applicable.

Version published to 10.21203/rs.3.rs-7714101/v1 on Research Square
Nov 3, 2025

Understanding the Inner Workings of Large Language Models in Medicine

This article has 3 authors:
1. Georg Fuellen
2. Hans Jarchow
3. Johann-Christian Põder
This article has no evaluationsLatest version Oct 8, 2025
Structured Taxonomy and Framework for Developing Medical Benchmark in Large Language Models Derived from Scoping Review

This article has 3 authors:
1. Junbok Lee
2. Jaeyong Shin
3. Belong Cho
This article has no evaluationsLatest version Nov 14, 2025
Benchmarking Large Language Models and Clinicians Using Locally Generated Primary Healthcare Vignettes in Kenya

This article has 11 authors:
1. Paul Mwaniki
2. Wilkister Musau
3. Lynda Isaaka
4. Conrad Wanyama
5. Vaishnavi Menon
6. Alastair Denniston
7. Xiaoxuan Liu
8. Mira Emmanuel-Fabula
9. Gwydion Williams
10. Bilal A. Mateen
11. Ambrose Agweyu
This article has no evaluationsLatest version Oct 27, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Understanding the Inner Workings of Large Language Models in Medicine

Structured Taxonomy and Framework for Developing Medical Benchmark in Large Language Models Derived from Scoping Review

Benchmarking Large Language Models and Clinicians Using Locally Generated Primary Healthcare Vignettes in Kenya