Evaluation of Closed and Open Large Language Models in Pediatric Cardiology Board Exam Performance

Nino Nikolovski
Conall T. Morgan
Michael N. Gritti

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Introduction

Large language models (LLMs) have gained traction in medicine, but there is limited research comparing closed- and open-source models in subspecialty contexts. This study evaluated ChatGPT-4.0o and DeepSeek–R1 on a pediatric cardiology board-style examination to quantify their accuracy and discuss clinical and educational utility.

Methods

ChatGPT-4.0o and DeepSeek–R1 were used to answer 88 text-based multiple-choice questions across 11 pediatric cardiology subtopics from a Pediatric Cardiology Board Review textbook. DeepSeek–R1’s processing time per question was measured. Statistical analyses for model comparison were conducted using an unpaired two-tailed t-test, and bivariate correlations were assessed using Pearson’s r.

Results

ChatGPT-4.0o and DeepSeek–R1 achieved 70% (62/88) and 68% (60/88) accuracy, respectively (p=0.79). Subtopic accuracy was equal in 5 of 11 chapters, with each model outperforming its counterpart in 3 of 11. DeepSeek–R1’s processing time negatively correlated with accuracy (r = –0.68, p = 0.02).

Conclusion

ChatGPT-4.0o and DeepSeek–R1 approached the passing threshold on a pediatric cardiology board examination, with comparable accuracy and potential for open-source models to enhance clinical and educational outcomes while supporting sustainable AI development.

Version published to 10.1101/2025.06.28.25330485 on medRxiv
Jun 30, 2025

AI-literacy training enhances physician-LLM diagnostic collaboration in a resource-limited setting: a randomized controlled trial

This article has 6 authors:
1. Ihsan Ayyub Qazi
2. Ayesha Ali
3. Asad Ullah Khawaja
4. Muhammad Junaid Akhtar
5. Ali Zafar Sheikh
6. Muhammad Hamad Alizai
This article has no evaluationsLatest version Jun 6, 2025
Large Language Models for the assessment of medical students’ clinical decision-making

This article has 5 authors:
1. Sina Chole Benker
2. Jonathan Vollprecht
3. Cihan Papan
4. Max Hao Lu
5. Dogus Darici
This article has no evaluationsLatest version Jun 17, 2025
Assessing accuracy and legitimacy of multimodal large language models on Japan Diagnostic Radiology Board Examination

This article has 10 authors:
1. Yuichiro Hirano
2. Soichiro Miki
3. Yosuke Yamagishi
4. Shouhei Hanaoka
5. Takahiro Nakao
6. Tomohiro Kikuchi
7. Yuta Nakamura
8. Yukihiro Nomura
9. Takeharu Yoshikawa
10. Osamu Abe
This article has no evaluationsLatest version Jun 23, 2025

Listed in

Abstract

Introduction

Methods

Results

Conclusion

Article activity feed

Related articles

AI-literacy training enhances physician-LLM diagnostic collaboration in a resource-limited setting: a randomized controlled trial

Large Language Models for the assessment of medical students’ clinical decision-making

Assessing accuracy and legitimacy of multimodal large language models on Japan Diagnostic Radiology Board Examination