Comparative Assessment of the Accuracy of Different Artificial Intelligence Models in Answering Analytical and Knowledge-Based Questions in Oral and Maxillofacial Radiology and Oral and Maxillofacial Surgery; A Research Article

Erçin SAMUNAHMETOGLU
Arzum YILMAZ

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background Artificial intelligence models are increasingly used in healthcare education; however, their ability to handle both factual knowledge and analytical clinical reasoning in dentistry remains unclear. This study aimed to compare the accuracy of different AIs in answering knowledge-based and analytical multiple-choice questions in Oral and Maxillofacial Radiology (OMFR) and Oral and Maxillofacial Surgery (OMFS), and to evaluate performance differences according to cognitive task type. Methods This cross-sectional comparative study analyzed 258 multiple-choice questions from the Turkish Dental Specialty Examination (DUS) conducted between 2012 and 2021 (202 knowledge-based, 56 analytical). Five AI models (ChatGPT-5.2 Go, ChatGPT-5.2 Plus, DeepSeek V3, Claude Sonnet 4.5, and Gemini 3 Flash) answered all questions under default settings in a single session. Accuracy rates were compared using Chi-square and Kruskal–Wallis tests with Bonferroni correction. Inter-model agreement and reliability were assessed using Cohen’s kappa and the intraclass correlation coefficient (ICC) (α = 0.05). Results Significant differences among models were observed in knowledge-based questions (p = 0.048), analytical questions (p = 0.032), and overall accuracy (p = 0.006). Gemini achieved the highest accuracy in knowledge-based questions, while Claude demonstrated the lowest performance. Although a general difference was detected in analytical questions, pairwise comparisons did not show clear model superiority. Overall performance largely reflected success in knowledge-based tasks. Agreement analysis showed low kappa values (κ = 0.226–0.339) but moderate ICC levels (0.597–0.728). Conclusions AIs demonstrate strong factual recall but remain limited in analytical clinical reasoning tasks. While these models may serve as supportive tools in dental education, their use as independent clinical decision-making systems is not yet reliable.

Version published to 10.21203/rs.3.rs-8776716/v1 on Research Square
Mar 13, 2026

Accuracy of Large Language Models on Multiple-Choice Questions in Oral and Dentomaxillofacial Radiology: A Comparison Based on the Revised Bloom’s Taxonomy

This article has 3 authors:
1. Ezgi TURK AKBULUT
2. Hatice Ahsen DENIZ
3. Can ATES
This article has no evaluationsLatest version Feb 25, 2026
Impact of Query Language on Large Language Model Performance in Dental Trauma Management: A Comparative Evaluation of ChatGPT, Gemini, and Claude

This article has 2 authors:
1. Hasan Öz
2. Mehmet Dundar
This article has no evaluationsLatest version Feb 20, 2026
Assessment of Medical Students’ Perception and Knowledge Toward Artificial Intelligence and its Medical Applications among a Sample of New Giza University Students: A Cross-sectional Study

This article has 9 authors:
1. Mohamed Adel Elshobasy
2. Youssef Wafik Aziz
3. Mohamed Saad Abdelnaby
4. Muhammad Ali Rushdi
5. Reem Saber Mohamed
6. Omar Mohammed Alhagrasi
7. Mariam Megahed Tolba
8. Dina Nabih Boulos
9. Sherif Essam Eldeeb
This article has no evaluationsLatest version Mar 19, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Accuracy of Large Language Models on Multiple-Choice Questions in Oral and Dentomaxillofacial Radiology: A Comparison Based on the Revised Bloom’s Taxonomy

Impact of Query Language on Large Language Model Performance in Dental Trauma Management: A Comparative Evaluation of ChatGPT, Gemini, and Claude

Assessment of Medical Students’ Perception and Knowledge Toward Artificial Intelligence and its Medical Applications among a Sample of New Giza University Students: A Cross-sectional Study