Comparison of the ChatGPT and DeepSeek models in responding to multiple choice questions related to rehabilitation of completely edentulous patients with complete dentures
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background. Artificial intelligence (AI) chatbots are considered a potential resource for dental education. However, there is still a lack of adequate knowledge regarding the trustworthiness, validity, and utility of the content on these platforms, and its application in dental education is still understudied. Methods. A set of Multiple-Choice Questions (MCQs) consisting of 100 questions related to removable complete denture prosthodontics were formulated. Two AI models, DeepSeek-V3 and Chat GPT-4o, were used to deliver the queries. First, accuracy of the generated answers was assessed. Subsequently, two reviewers assessed the LLM usefulness and reliability of the responses using modified 5-point Likert scales. McNemar test was done to test the difference between ChatGPT-o4 and DeepSeek-V3 in determining accuracy of scores. The Wilcoxon sign rank test was done to determine the difference between reliability and usefulness between the two AI tools. The Chi square test was used to determine the proportional differences between type of questions and accuracy of scores (α = .05). Results. The accuracy of the responses was 59% and 66% for ChatGPT-4o and DeepSeek-V3 respectively. There was no significant difference between two AI tools in delivering accuracy of scores (P = 0.281). There was significant difference between reliability scores exhibited by ChatGPT and DeepSeek (p = 0.027). Deepseek exhibited a statistically significant higher overall reliability score. Similarly, significant difference in usefulness scores was exhibited in two AI tools (p = < 0.001). Most of the items in ChatGPT showed inaccuracy of responses in analytical based questions than knowledge-based questions and the difference was statistically significant (P = 0.047). Conclusions. There was no significant difference between ChatGPT-4o and DeepSeek-V3 in delivering accuracy of scores. There is significantly higher reliability and usefulness of responses generated by DeepSeek-V3 compared to ChatGPT-4o. The responses generated by ChatGPT-4o showed greater inaccuracies for analytical-based questions compared to knowledge-based questions.