A Comparative Analysis of Five AI Chatbots in Providing Patient Education on Smile Design
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: This study aimed to evaluate and compare the accuracy, quality, readability, understandability, and actionability of responses provided by five AI chatbots—Microsoft Copilot, ChatGPT-4, ChatGPT-5, Google Gemini, and Claude Sonet 4.5—to patient questions about smile design and anterior aesthetic dental procedures. Method: Twenty-eight patient-oriented questions were collected from Reddit and Quora. A volunteer asked these questions to the five AI chatbots on the same day in a blinded order. Each response was recorded and coded to maintain anonymity. Two prosthodontists independently assessed the responses for accuracy using a 5-point Likert scale, quality using the Global Quality Scale (GQS), and understandability and actionability using the Patient Education Materials Assessment Tool (PEMAT-P). Readability was measured with Flesch Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKGL). Inter-rater reliability was calculated using Cohen’s kappa. Statistical analyses were performed using Kruskal-Wallis tests for non-parametric data and ANOVA for normally distributed readability scores, with p < 0.05 considered statistically significant. Results: Significant differences were observed in accuracy (p = 0.013) and quality (p < 0.001) among the chatbots. ChatGPT-5 had lower accuracy than Google Gemini (p = 0.017) and Claude Sonet 4.5 (p = 0.041) and lower quality than all other chatbots (p < 0.001). Readability differed significantly (FRE: p = 0.004; FKGL: p < 0.001), with ChatGPT-5 responses requiring the highest reading level. PEMAT-P scores also showed significant differences in understandability and actionability (p < 0.001), with ChatGPT-5 displaying lower scores than other chatbots. Microsoft Copilot, ChatGPT-4, and Google Gemini generally provided higher-quality, more understandable, and actionable information, while ChatGPT-5 and Claude Sonet 4.5 showed limitations. Most chatbot responses were above an eighth-grade reading level, which may challenge general patient comprehension. Conclusion: AI chatbots vary considerably in the quality and usefulness of information they provide for complex dental procedures like smile design. While some models deliver accurate and comprehensible responses, others may produce lower-quality, less actionable content. Despite high understandability in most responses, high reading levels and low actionability could limit patient comprehension and effective decision-making. Care should be taken when patients rely on AI chatbots for dental education, and further improvements are needed to enhance reliability, readability, and actionable guidance.