Closing the Pediatric Divide: A Performance Analysis of the GPT-5 Family in Medical Diagnostics

Gianluca Mondillo
Fabio Giovanni Abbate
Mariapia Masino
Simone Colosimo
Alessandra Perrotta
Vittoria Frattolillo

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Large Language Models (LLMs) have demonstrated significant potential in clinical medicine, but a persistent performance gap exists in the pediatric domain due to its unique complexities. This study provides the first comparative evaluation of the new GPT-5 family (Nano, Mini, and full) to assess the impact of model scale on diagnostic accuracy and this specific adult-pediatric disparity.

Methods

A benchmarking study was conducted using 2,000 multiple-choice questions from the MedQA dataset, equally divided between adult (n=1,000) and pediatric (n=1,000) domains. GPT-5, GPT-5 Mini, and GPT-5 Nano were tested via API with standardized parameters (temperature=0, reasoning effort=minimal, verbosity=low, maxtoken=170). Accuracy was calculated and statistically compared across domains for each model.

Results

A clear dose-response relationship was observed between model size and accuracy. GPT-5 Nano exhibited a significant performance gap, with an accuracy of 71.0% in adult medicine versus 55.4% in pediatrics (a 15.6 percentage point difference, p<0.001). GPT-5 Mini substantially narrowed this gap to 5.7 points (81.5% vs. 75.8%, p=0.001). Critically, the full GPT-5 model eliminated the disparity, achieving comparable accuracy in adult medicine (86.3%) and slightly higher accuracy in pediatrics (88.5%) (p=0.138). Performance gains from scaling up were disproportionately larger for the pediatric domain.

Conclusion

The GPT-5 family marks a substantial advancement in medical AI. The full-size model not only achieves high diagnostic accuracy but, crucially, overcomes the previously documented performance limitations in pediatrics. This demonstrates that sufficient model scale is vital for mastering the nuances of specialized clinical domains. These findings support a tiered implementation strategy based on task criticality and underscore the need for continued validation in real-world clinical settings.

Version published to 10.1101/2025.08.28.25334657 on medRxiv
Aug 29, 2025

QoQ-Med3: Robust Multimodal Clinical Analysis Foundation Model with Reasoning

This article has 10 authors:
1. David Dai
2. Jeannie She
3. Jiaee Cheong
4. Xing Han
5. Carl Harris
6. Haowen Wei
7. Farzan Vahedifard
8. Suchi Saria
9. Robert Stevens
10. Paul Liang
This article has no evaluationsLatest version Dec 30, 2025
Development and Validation of the PASS Score: A Simplified Tool to Diagnose Acquired Aplastic Anemia in Adults

This article has 19 authors:
1. Gabriel Aleixo
2. HeeJin Cheon
3. Jiayin Zheng
4. Stephanie Soewito
5. Jimmy Lee
6. Eléonore Kaphan
7. Neha Kalakuntla
8. Wei-Ying Jen
9. Sumasri Kotha
10. Alex Rupsee
11. Mia Djulbegovic
12. Jairo A. Matthews
13. Tapan M. Kadia
14. Timothy S. Olson
15. Régis Peffault de Latour
16. Flore Sicre de Fontbrune
17. Taha Bat
18. Courtney D. DiNardo
19. Daria V. Babushok
This article has no evaluationsLatest version Jan 22, 2026
Benchmarking large language models for cardiovascular risk stratification using clinical vignettes

This article has 11 authors:
1. José Ferreira Santos
2. Regina Brito Duarte
3. Inês Mota
4. Rita Carvalheira Santos
5. José Maria Moreira
6. Joana Campos
7. Nuno André Silva
8. Bernardo Neves
9. Ricardo Ladeiras-Lopes
10. Francisca Leite
11. Helder Dores
This article has no evaluationsLatest version Dec 30, 2025

Discuss this preprint

Listed in

Abstract

Background

Methods

Results

Conclusion

Article activity feed

Related articles

QoQ-Med3: Robust Multimodal Clinical Analysis Foundation Model with Reasoning

Development and Validation of the PASS Score: A Simplified Tool to Diagnose Acquired Aplastic Anemia in Adults

Benchmarking large language models for cardiovascular risk stratification using clinical vignettes