Evaluation of Closed and Open Large Language Models in Pediatric Cardiology Board Exam Performance

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Introduction

Large language models (LLMs) have gained traction in medicine, but there is limited research comparing closed- and open-source models in subspecialty contexts. This study evaluated ChatGPT-4.0o and DeepSeek–R1 on a pediatric cardiology board-style examination to quantify their accuracy and discuss clinical and educational utility.

Methods

ChatGPT-4.0o and DeepSeek–R1 were used to answer 88 text-based multiple-choice questions across 11 pediatric cardiology subtopics from a Pediatric Cardiology Board Review textbook. DeepSeek–R1’s processing time per question was measured. Statistical analyses for model comparison were conducted using an unpaired two-tailed t-test, and bivariate correlations were assessed using Pearson’s r.

Results

ChatGPT-4.0o and DeepSeek–R1 achieved 70% (62/88) and 68% (60/88) accuracy, respectively (p=0.79). Subtopic accuracy was equal in 5 of 11 chapters, with each model outperforming its counterpart in 3 of 11. DeepSeek–R1’s processing time negatively correlated with accuracy (r = –0.68, p = 0.02).

Conclusion

ChatGPT-4.0o and DeepSeek–R1 approached the passing threshold on a pediatric cardiology board examination, with comparable accuracy and potential for open-source models to enhance clinical and educational outcomes while supporting sustainable AI development.

Article activity feed