Validity and reliability of AI chatbots on the comparative diagnosis and definitive management of deep caries based on position statements evaluated from post-graduate students and clinicians' perspectives

Ashish J Johnson
Tarun Kumar Singh
R Periyasamy
Aakash Gupta
Ikroop Gill

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Aim: This study aimed to evaluate the validity and reliability of prominent AI chatbots—ChatGPT, Perplexity, Claude, and Gemini—in the comparative diagnosis and definitive management of deep caries, guided by global position statements from endodontic organizations, as assessed by post-graduate students and clinicians. Methods: Four AI chatbots (ChatGPT, Perplexity, Claude, and Gemini) were accessed through their respective APIs using pro versions. Ten short case histories representing a spectrum of deep caries scenarios, along with corresponding position statements from the European Society of Endodontology, American Association of Endodontists, Indian Endodontic Society and others, were provided to each chatbot. Chatbots were prompted to generate diagnostic and management responses, which were repeated thrice per case per chatbot. Responses were evaluated by two postgraduate students and three senior clinicians using a 5-point Likert scale and an adapted Global Quality Score (GQS) for validity, and Cronbach’s alpha for reliability. Statistical analysis included low- and high-threshold validity tests and intergroup reliability comparisons. Conclusion: Perplexity exhibited the highest reliability and validity in deep caries diagnosis and management compared to ChatGPT, Claude, and Gemini. While Perplexity, Claude, and Gemini demonstrated perfect or near-perfect validity at low-threshold criteria, only Perplexity maintained moderate validity at high-stringency levels. Overall variability and reduced descriptive depth across all chatbot outputs highlight current limitations for clinical implementation. AI chatbots may serve as useful educational or adjunctive tools but cannot substitute professional judgment in endodontic diagnosis and treatment. Future development should focus on enhancing performance mechanisms and regulatory oversight to support clinical accuracy and reliability.

Version published to 10.21203/rs.3.rs-8320702/v1 on Research Square
Jan 30, 2026

Expert-Based Evaluation of ChatGPT for Removable Partial Denture Design: Accuracy and Reliability Analysis

This article has 5 authors:
1. Ebru ARSLAN
2. Simge Alıcı
3. Busemin Kesgin
4. Selim Erkut
5. Caner İncekaş
This article has no evaluationsLatest version Jan 23, 2026
Comparative efficacy of ChatGPT-5.1 Auto and DeepSeek-V3.1 Thinking in answering patients’ questions on cervical spine surgery

This article has 4 authors:
1. Xiaoyang Huo
2. Jiaming Zhou
3. Rongzhi Ma
4. Yuan Xue
This article has no evaluationsLatest version Jan 23, 2026
How Accurate and Consistent Are Large Language Models in Restorative Dentistry Questions? A Cross-Sectional Test-Retest Study

This article has 2 authors:
1. Kemal Furkan GÜDÜL
2. Baturalp ARSLAN
This article has no evaluationsLatest version Feb 2, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Expert-Based Evaluation of ChatGPT for Removable Partial Denture Design: Accuracy and Reliability Analysis

Comparative efficacy of ChatGPT-5.1 Auto and DeepSeek-V3.1 Thinking in answering patients’ questions on cervical spine surgery

How Accurate and Consistent Are Large Language Models in Restorative Dentistry Questions? A Cross-Sectional Test-Retest Study