Comparative Evaluation of Advanced AI Reasoning Models in Pediatric Clinical Decision Support: ChatGPT O1 vs. DeepSeek-R1
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Introduction
The adoption of advanced reasoning models, such as ChatGPT O1 and DeepSeek-R1, represents a pivotal step forward in clinical decision support, particularly in pediatrics. ChatGPT O1 employs “chain-of-thought reasoning” (CoT) to enhance structured problem-solving, while DeepSeek-R1 introduces self-reflection capabilities through reinforcement learning. This study aimed to evaluate the diagnostic accuracy and clinical utility of these models in pediatric scenarios using the MedQA dataset.
Materials and Methods
A total of 500 multiple-choice pediatric questions from the MedQA dataset were presented to ChatGPT O1 and DeepSeek-R1. Each question included four or more options, with one correct answer. The models were evaluated under uniform conditions, with performance metrics including accuracy, Cohen’s Kappa, and chi-square tests applied to assess agreement and statistical significance. Responses were analyzed to determine the models effectiveness in addressing clinical questions.
Results
ChatGPT O1 achieved a diagnostic accuracy of 92.8%, significantly outperforming DeepSeek-R1, which scored 87.0% ( p < 0.00001). The CoT reasoning technique used by ChatGPT O1 allowed for more structured and reliable responses, reducing the risk of errors. Conversely, DeepSeek-R1, while slightly less accurate, demonstrated superior accessibility and adaptability due to its open-source nature and emerging self-reflection capabilities. Cohen’s Kappa (K=0.20) indicated low agreement between the models, reflecting their distinct reasoning strategies.
Conclusions
This study highlights the strengths of ChatGPT O1 in providing accurate and coherent clinical reasoning, making it highly suitable for critical pediatric scenarios.
DeepSeek-R1, with its flexibility and accessibility, remains a valuable tool in resource-limited settings. Combining these models in an ensemble system could leverage their complementary strengths, optimizing decision support in diverse clinical contexts. Further research is warranted to explore their integration into multidisciplinary care teams and their application in real-world clinical settings.