ARE LLMS READY FOR PEDIATRICS? A COMPARATIVE EVALUATION OF MODEL ACCURACY ACROSS CLINICAL DOMAINS

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large Language Models (LLMs) are rapidly emerging as promising tools in the healthcare field, yet their effectiveness in pediatric contexts remains underexplored. This study evaluated the performance of eight contemporary LLMs, released between 2024 and 2025, in answering multiple-choice questions from the MedQA dataset, stratified into two distinct categories: adult medicine (1461 questions) and pediatrics (1653 questions). Models were tested using a standardized prompting methodology with default hyperparameters, simulating real-world use by non-expert clinical users. Accuracy scores for adult and pediatric subsets were statistically compared using the chi-square test with Yates’ correction. Five models (Amazon Nova Pro 1.0, GPT 3.5-turbo-0125, Gemini 2.0 Flash, Grok 2, and Claude 3 Sonnet) demonstrated significantly lower performance on pediatric questions, with accuracy drops of up to more than 10 percentage points. In contrast, ChatGPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet showed comparable performance across both domains, with ChatGPT-4o achieving the most balanced result (accuracy: 83.57% adult, 83.18% pediatric; p = 0.80). These findings suggest that while some models struggle with pediatric-specific content, more recent and advanced LLMs may offer improved generalizability and domain robustness. The observed variability highlights the critical importance of domain-specific validation prior to clinical implementation, particularly in specialized fields such as pediatrics.

Article activity feed