How Accurate and Consistent Are Large Language Models in Restorative Dentistry Questions? A Cross-Sectional Test-Retest Study
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background In recent years, large language models (LLMs) have emerged as a notable breakthrough in the field of artificial intelligence. The aim of this study is to compare the accuracy levels of different LLMs on multiple-choice questions (MCQs) related to the field of restorative dentistry in the dental specialisation exam (DUS) administered in Turkey and to evaluate response consistency (test–retest reliability) between two sessions. Methods In this study, 127 text-based MCQs related to restorative dentistry, without visual material, were used in the DUS. The responses from the ChatGPT-5.1, Gemini 2.5 Pro, Microsoft Copilot, and DeepSeek-v3.2 models were evaluated at two different time points (T1 and T2) and coded as correct/incorrect according to the official answer key. The accuracy of the models was analysed using Cochran's Q test, and the inter-session variation was analysed using the McNemar test. Test–retest response reliability was assessed using Cohen's Kappa coefficient and percentage agreement rates. Results ChatGPT-5.1 achieved the highest accuracy rate in both sessions, while DeepSeek-v3.2 demonstrated relatively lower accuracy performance. However, no statistically significant difference was found between the models' T1 and T2 accuracy rates. No significant performance difference was identified between the models in subcategory-based analyses either. Test–retest analyses revealed that, despite high accuracy rates, response stability could vary depending on the model, and Cohen's Kappa values ranged from low to moderate levels. Conclusions It is thought that LLMs can answer questions about theoretical knowledge in the field of restorative dentistry with high accuracy, but may show limitations in terms of time-dependent response consistency. These findings suggest that while LLMs have potential as supportive tools in dental education, their use without human oversight requires careful consideration.