Accuracy of Large Language Models in the Dental Specialization Examination: A Multidimensional Analysis

Fatih Karaaslan
Muhammed Halil Yilan
Merve Tirimoğulları

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background This study aimed to evaluate the performance of contemporary Large Language Models (LLMs) on the clinical sciences component of the Turkish Dental Specialization Examination (DUS) by comparing their accuracy across disciplines, examination years, and question types. Methods A total of 1,427 clinical sciences questions from the official DUS examinations were analyzed. All items were evaluated in Turkish, classified into eight clinical dental disciplines, and categorized into MCQ, CMCQ, or IBMCQ formats. Each question was independently submitted to seven contemporary LLMs, and responses were scored against the official answer keys. Statistical comparisons were performed using chi-square tests with Bonferroni-adjusted multiple comparisons (p < 0.05). Results Significant differences in accuracy were observed among the seven LLMs (χ² = 729.63; p < 0.001). Gemini 2.5 Pro achieved the highest accuracy (93.4%), whereas Qwen 3.0 MA showed the lowest (57.4%). Across disciplines, Oral and Maxillofacial Surgery and Periodontology demonstrated the highest accuracies, while Prosthodontics consistently exhibited the lowest performance (all p < 0.001). Accuracy differed significantly in MCQ and CMCQ formats (both p < 0.001) but declined sharply in IBMCQ items (32.6%–65.2%), where inter-model differences were not significant (p = 0.087). Year-based analyses indicated significant inter-model variation (p < 0.05), with DeepSeek-V3.2 showing the greatest temporal stability. Conclusions This study demonstrates clear performance differences among contemporary AI models in dental specialty examinations. Gemini 2.5 Pro showed the highest overall and year-to-year accuracy, whereas Qwen 3.0 MA consistently performed the lowest. DeepSeek V3.2 was the most stable model over time. While Gemini 2.5 Pro excelled in MCQ and CMCQ formats, all models exhibited marked accuracy loss in image-based IBMCQ items. The highest domain-specific performance occurred in Oral and Maxillofacial Surgery and Periodontology, whereas Prosthodontics remained the most challenging field.

Version published to 10.21203/rs.3.rs-8772142/v1 on Research Square
Mar 10, 2026

Clinical Safety of Large Language Models in Oral Cancer–Related Patient Communication: A Longitudinal Study

This article has 2 authors:
1. Burcu Yeliz KOLLAYAN
2. Tuğba CEBECİ
This article has no evaluationsLatest version Mar 16, 2026
Language-dependent variability in large language model performance on pharmaceutical knowledge tasks

This article has 7 authors:
1. Hiroto Asano
2. Yu-Shi Tian
3. Asuka Hatabu
4. Minako Ohishi
5. Kaori Fukuzawa
6. Daisuke Takaya
7. Kenji Ikeda
This article has no evaluationsLatest version Mar 27, 2026
Evaluating 11 Large Language Models in Answering Key Questions on Ovarian Cancer

This article has 7 authors:
1. Michela Quaranta
2. Yong Sheng Tan
3. Areti Karamanou
4. Evangelos Kalampokis
5. Nicolas M Orsi
6. Diederick DeJong
7. Alexandros Laios
This article has no evaluationsLatest version Apr 11, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Clinical Safety of Large Language Models in Oral Cancer–Related Patient Communication: A Longitudinal Study

Language-dependent variability in large language model performance on pharmaceutical knowledge tasks

Evaluating 11 Large Language Models in Answering Key Questions on Ovarian Cancer