Accuracy of Large Language Models in the Dental Specialization Examination: A Multidimensional Analysis
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background This study aimed to evaluate the performance of contemporary Large Language Models (LLMs) on the clinical sciences component of the Turkish Dental Specialization Examination (DUS) by comparing their accuracy across disciplines, examination years, and question types. Methods A total of 1,427 clinical sciences questions from the official DUS examinations were analyzed. All items were evaluated in Turkish, classified into eight clinical dental disciplines, and categorized into MCQ, CMCQ, or IBMCQ formats. Each question was independently submitted to seven contemporary LLMs, and responses were scored against the official answer keys. Statistical comparisons were performed using chi-square tests with Bonferroni-adjusted multiple comparisons (p < 0.05). Results Significant differences in accuracy were observed among the seven LLMs (χ² = 729.63; p < 0.001). Gemini 2.5 Pro achieved the highest accuracy (93.4%), whereas Qwen 3.0 MA showed the lowest (57.4%). Across disciplines, Oral and Maxillofacial Surgery and Periodontology demonstrated the highest accuracies, while Prosthodontics consistently exhibited the lowest performance (all p < 0.001). Accuracy differed significantly in MCQ and CMCQ formats (both p < 0.001) but declined sharply in IBMCQ items (32.6%–65.2%), where inter-model differences were not significant (p = 0.087). Year-based analyses indicated significant inter-model variation (p < 0.05), with DeepSeek-V3.2 showing the greatest temporal stability. Conclusions This study demonstrates clear performance differences among contemporary AI models in dental specialty examinations. Gemini 2.5 Pro showed the highest overall and year-to-year accuracy, whereas Qwen 3.0 MA consistently performed the lowest. DeepSeek V3.2 was the most stable model over time. While Gemini 2.5 Pro excelled in MCQ and CMCQ formats, all models exhibited marked accuracy loss in image-based IBMCQ items. The highest domain-specific performance occurred in Oral and Maxillofacial Surgery and Periodontology, whereas Prosthodontics remained the most challenging field.