A Comparative Analysis of Five AI Chatbots in Providing Patient Education on Smile Design

Bahadır Ezmek
Hasan Alper Uyar

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background: This study aimed to evaluate and compare the accuracy, quality, readability, understandability, and actionability of responses provided by five AI chatbots—Microsoft Copilot, ChatGPT-4, ChatGPT-5, Google Gemini, and Claude Sonet 4.5—to patient questions about smile design and anterior aesthetic dental procedures. Method: Twenty-eight patient-oriented questions were collected from Reddit and Quora. A volunteer asked these questions to the five AI chatbots on the same day in a blinded order. Each response was recorded and coded to maintain anonymity. Two prosthodontists independently assessed the responses for accuracy using a 5-point Likert scale, quality using the Global Quality Scale (GQS), and understandability and actionability using the Patient Education Materials Assessment Tool (PEMAT-P). Readability was measured with Flesch Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKGL). Inter-rater reliability was calculated using Cohen’s kappa. Statistical analyses were performed using Kruskal-Wallis tests for non-parametric data and ANOVA for normally distributed readability scores, with p < 0.05 considered statistically significant. Results: Significant differences were observed in accuracy (p = 0.013) and quality (p < 0.001) among the chatbots. ChatGPT-5 had lower accuracy than Google Gemini (p = 0.017) and Claude Sonet 4.5 (p = 0.041) and lower quality than all other chatbots (p < 0.001). Readability differed significantly (FRE: p = 0.004; FKGL: p < 0.001), with ChatGPT-5 responses requiring the highest reading level. PEMAT-P scores also showed significant differences in understandability and actionability (p < 0.001), with ChatGPT-5 displaying lower scores than other chatbots. Microsoft Copilot, ChatGPT-4, and Google Gemini generally provided higher-quality, more understandable, and actionable information, while ChatGPT-5 and Claude Sonet 4.5 showed limitations. Most chatbot responses were above an eighth-grade reading level, which may challenge general patient comprehension. Conclusion: AI chatbots vary considerably in the quality and usefulness of information they provide for complex dental procedures like smile design. While some models deliver accurate and comprehensible responses, others may produce lower-quality, less actionable content. Despite high understandability in most responses, high reading levels and low actionability could limit patient comprehension and effective decision-making. Care should be taken when patients rely on AI chatbots for dental education, and further improvements are needed to enhance reliability, readability, and actionable guidance.

Version published to 10.21203/rs.3.rs-8210813/v1 on Research Square
Dec 1, 2025

Comparative efficacy of ChatGPT-5.1 Auto and DeepSeek-V3.1 Thinking in answering patients’ questions on cervical spine surgery

This article has 4 authors:
1. Xiaoyang Huo
2. Jiaming Zhou
3. Rongzhi Ma
4. Yuan Xue
This article has no evaluationsLatest version Jan 23, 2026
Comparison of the ChatGPT and deepseek models in responding to multiple choice questions related to rehabilitation of completely edentulous patients with complete dentures

This article has 4 authors:
1. Amar Bhochhibhoya
2. Brijesh Maskey
3. Rejina Shrestha
4. Sirjana Dahal
This article has no evaluationsLatest version Jan 30, 2026
Validity and reliability of AI chatbots on the comparative diagnosis and definitive management of deep caries based on position statements evaluated from post-graduate students and clinicians' perspectives

This article has 5 authors:
1. Ashish J Johnson
2. Tarun Kumar Singh
3. R Periyasamy
4. Aakash Gupta
5. Ikroop Gill
This article has no evaluationsLatest version Jan 30, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Comparative efficacy of ChatGPT-5.1 Auto and DeepSeek-V3.1 Thinking in answering patients’ questions on cervical spine surgery

Comparison of the ChatGPT and deepseek models in responding to multiple choice questions related to rehabilitation of completely edentulous patients with complete dentures

Validity and reliability of AI chatbots on the comparative diagnosis and definitive management of deep caries based on position statements evaluated from post-graduate students and clinicians' perspectives