Comparative Assessment of the Accuracy of Different Artificial Intelligence Models in Answering Analytical and Knowledge-Based Questions in Oral and Maxillofacial Radiology and Oral and Maxillofacial Surgery; A Research Article
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Artificial intelligence models are increasingly used in healthcare education; however, their ability to handle both factual knowledge and analytical clinical reasoning in dentistry remains unclear. This study aimed to compare the accuracy of different AIs in answering knowledge-based and analytical multiple-choice questions in Oral and Maxillofacial Radiology (OMFR) and Oral and Maxillofacial Surgery (OMFS), and to evaluate performance differences according to cognitive task type. Methods This cross-sectional comparative study analyzed 258 multiple-choice questions from the Turkish Dental Specialty Examination (DUS) conducted between 2012 and 2021 (202 knowledge-based, 56 analytical). Five AI models (ChatGPT-5.2 Go, ChatGPT-5.2 Plus, DeepSeek V3, Claude Sonnet 4.5, and Gemini 3 Flash) answered all questions under default settings in a single session. Accuracy rates were compared using Chi-square and Kruskal–Wallis tests with Bonferroni correction. Inter-model agreement and reliability were assessed using Cohen’s kappa and the intraclass correlation coefficient (ICC) (α = 0.05). Results Significant differences among models were observed in knowledge-based questions (p = 0.048), analytical questions (p = 0.032), and overall accuracy (p = 0.006). Gemini achieved the highest accuracy in knowledge-based questions, while Claude demonstrated the lowest performance. Although a general difference was detected in analytical questions, pairwise comparisons did not show clear model superiority. Overall performance largely reflected success in knowledge-based tasks. Agreement analysis showed low kappa values (κ = 0.226–0.339) but moderate ICC levels (0.597–0.728). Conclusions AIs demonstrate strong factual recall but remain limited in analytical clinical reasoning tasks. While these models may serve as supportive tools in dental education, their use as independent clinical decision-making systems is not yet reliable.