Accuracy of Large Language Models on Multiple-Choice Questions in Oral and Dentomaxillofacial Radiology: A Comparison Based on the Revised Bloom’s Taxonomy

Ezgi TURK AKBULUT
Hatice Ahsen DENIZ
Can ATES

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background: Large language models have recently attracted attention for their potential applications in health sciences education and clinical decision support. This study evaluated the performance of several large language models (ChatGPT-4 Omni, Gemini 2.0 Flash, Microsoft Copilot, Mistral Large 2, and DeepSeek V3) in answering multiple-choice questions in oral and dentomaxillofacial radiology according to the revised Bloom’s taxonomy. Methods: Oral and dentomaxillofacial radiology questions from the Dental Specialty Examination were classified by cognitive level according to the revised Bloom’s taxonomy and individually submitted to five large language models in independent sessions under default settings; responses were scored using the official answer key, and performance was analyzed using descriptive statistics and Cronbach’s alpha reliability analysis. Results: Across all models, the highest accuracy was observed at the remembering and understanding levels, whereas performance declined at higher cognitive levels. Accuracy was lowest at the applying level and remained moderate at the analyzing and evaluating levels, indicating limited performance on items requiring clinical reasoning. Cronbach’s alpha coefficients indicated acceptable to good internal consistency across most cognitive levels. Conclusions: These findings highlight the varying capabilities of large language models across cognitive domains and emphasize the need for further investigation into their use in more complex assessment formats and interactive learning environments.

Version published to 10.21203/rs.3.rs-8854404/v1 on Research Square
Feb 25, 2026

Impact of Query Language on Large Language Model Performance in Dental Trauma Management: A Comparative Evaluation of ChatGPT, Gemini, and Claude

This article has 2 authors:
1. Hasan Öz
2. Mehmet Dundar
This article has no evaluationsLatest version Feb 20, 2026
Accuracy of Large Language Models in the Dental Specialization Examination: A Multidimensional Analysis

This article has 3 authors:
1. Fatih Karaaslan
2. Muhammed Halil Yilan
3. Merve Tirimoğulları
This article has no evaluationsLatest version Mar 10, 2026
Comparative Assessment of the Accuracy of Different Artificial Intelligence Models in Answering Analytical and Knowledge-Based Questions in Oral and Maxillofacial Radiology and Oral and Maxillofacial Surgery; A Research Article

This article has 2 authors:
1. Erçin SAMUNAHMETOGLU
2. Arzum YILMAZ
This article has no evaluationsLatest version Mar 13, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Impact of Query Language on Large Language Model Performance in Dental Trauma Management: A Comparative Evaluation of ChatGPT, Gemini, and Claude

Accuracy of Large Language Models in the Dental Specialization Examination: A Multidimensional Analysis

Comparative Assessment of the Accuracy of Different Artificial Intelligence Models in Answering Analytical and Knowledge-Based Questions in Oral and Maxillofacial Radiology and Oral and Maxillofacial Surgery; A Research Article