Accuracy of Large Language Models on Multiple-Choice Questions in Oral and Dentomaxillofacial Radiology: A Comparison Based on the Revised Bloom’s Taxonomy

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: Large language models have recently attracted attention for their potential applications in health sciences education and clinical decision support. This study evaluated the performance of several large language models (ChatGPT-4 Omni, Gemini 2.0 Flash, Microsoft Copilot, Mistral Large 2, and DeepSeek V3) in answering multiple-choice questions in oral and dentomaxillofacial radiology according to the revised Bloom’s taxonomy. Methods: Oral and dentomaxillofacial radiology questions from the Dental Specialty Examination were classified by cognitive level according to the revised Bloom’s taxonomy and individually submitted to five large language models in independent sessions under default settings; responses were scored using the official answer key, and performance was analyzed using descriptive statistics and Cronbach’s alpha reliability analysis. Results: Across all models, the highest accuracy was observed at the remembering and understanding levels, whereas performance declined at higher cognitive levels. Accuracy was lowest at the applying level and remained moderate at the analyzing and evaluating levels, indicating limited performance on items requiring clinical reasoning. Cronbach’s alpha coefficients indicated acceptable to good internal consistency across most cognitive levels. Conclusions: These findings highlight the varying capabilities of large language models across cognitive domains and emphasize the need for further investigation into their use in more complex assessment formats and interactive learning environments.

Article activity feed