Evaluating AI Reasoning Models in Pediatric Medicine: A Comparative Analysis of o3-mini and o3-mini-high
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Artificial intelligence (AI) is increasingly playing a crucial role in modern medicine, particularly in clinical decision support. This study compares the performance of two OpenAI reasoning models, o3-mini and o3-mini-high, in answering 900 pediatric clinical questions derived from the MedQA-USMLE dataset. The evaluation focuses on accuracy, response time, and consistency to determine their effectiveness in pediatric diagnostic and therapeutic decision-making. The results indicate that o3-mini-high achieves a higher accuracy (90.55% vs. 88.3%) and faster response times (64.63 seconds vs. 71.63 seconds) compared to o3-mini. The chi-square test confirmed that these differences are statistically significant (X 2 = 328.9675, p < 0.00001)). Error analysis revealed that o3-mini-high corrected more errors from o3-mini than vice versa, but both models shared 61 common errors, suggesting intrinsic limitations in training data or model architecture. Additionally, accessibility differences between the models were considered. While DeepSeek-R1, evaluated in a previous study, offers unrestricted free access, OpenAI’s o3 models have message limitations, potentially influencing their suitability in resource-constrained environments. Future improvements should aim at reducing shared errors, optimizing o3-mini’s accuracy while maintaining efficiency, and refining o3-mini-high for enhanced performance. Implementing an ensemble approach that leverages both models’ strengths could provide a more robust AI-driven clinical decision support system, particularly in time-sensitive pediatric scenarios such as emergency care and neonatal intensive care units.