DentaCoPilot: An LLM-Augmented Next-Procedure Recommender for General Dentistry, Designed for Dentist Augmentation

Carson CONCEPTION Rodrigues
Steffie Dione Rebello

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Commercial dental artificial intelligence in 2026 is overwhelmingly diagnostic, focusing on caries, calculus, periapical, and bone-level detection on radiographs. The clinically harder question that follows every diagnosis-given a patient's chart and most recent procedure, what should the dentist do next?-remains unsolved at a general dentistry scale. The closest published system, MultiTP (Chen et al., 2024), is a CNN-RNN restricted to partial-edentulism cases and provides neither calibrated uncertainty, structured rationale, nor an evaluation that treats the model as decision support rather than as an autonomous classifier. We introduce DentaCoPilot, a recommender that, given a structured chart, returns (i) a calibrated top-K probability distribution over Current Dental Terminology (CDT) codes for the next procedure, (ii) a verbalized confidence label, (iii) an explicit abstain flag when context is insufficient, and (iv) a chart-grounded rationale. We compare four classical baselines (frequency bigram, TF-IDF + logistic regression, XGBoost, and a MultiTP-style CNN-RNN) and six large language model (LLM) variants (Claude Haiku, Sonnet + chain-of-thought, Sonnet + retrieval, Opus + chain-of-thought, Sonnet + classical prior, and Opus + classical prior) on a synthetic chart corpus of 500 patients (1,284 test examples). We further establish real-world external validity using the public MEPS 2023 Dental Visits corpus, comprising 11,016 next-visit transitions across 5,088 patients at treatment-category granularity. All LLM inference is routed through the local Anthropic Claude Code CLI, and every call is logged for auditability. On an apples-to-apples evaluation, classical baselines achieve 0.567 top-1 and 0.967 top-5 accuracy, whereas pure LLM variants achieve 0.267-0.467 top-1 accuracy. Prompt-conditioning a Sonnet LLM on the classical baseline's top-10 candidates (M5) substantially closes the gap: top-5 accuracy increases from 0.733 (pure Sonnet + chain-of-thought) to 0.933, approaching classical baseline performance on the synthetic dataset while preserving rationale generation and abstention capabilities. Because the synthetic generator shares Markov structure with the bigram baseline, these synthetic accuracy rankings should be interpreted as indicators of pipeline behavior rather than externally valid clinical performance. Increasing the LLM backbone from Sonnet to Opus does not improve accuracy, with or without priming. Calibration through temperature scaling and coverage-risk analysis is also reported for the baselines. On the real-world MEPS corpus, history-aware baselines achieve 0.479 top-1 and 0.781 top-3 accuracy (macro-F1 = 0.42), compared with 0.297 top-1 accuracy for a most-frequent baseline, confirming that procedure history is predictive of subsequent treatment categories. At the category granularity available in MEPS, a zero-shot LLM given only coarse procedure history underperforms the Markov baseline (0.213 vs. 0.533 top-1 accuracy; n = 150), indicating that the value of LLMs depends on access to richer chart information rather than category history alone. Prompt-conditioning a small LLM on a classical baseline's top-K predictions emerges as the most cost-effective LLM design evaluated for next-procedure recommendation, while preserving the augmentation features that distinguish the system from an autonomous classifier. Real-world external validity for the task is established using the public MEPS corpus. A pre-registered clinician-in-the-loop evaluation at the KLE Vishwanath Katti Institute of Dental Sciences (Belgaum, India), together with CDT-level multi-institutional validation under institutional data-use agreements, constitutes the next stage of this work.

Version published to 10.64898/2026.05.07.26352635 on medRxiv
May 8, 2026

Accuracy and Consistency of Frontier LLMs on Orthodontic Diagnostic Tasks: A Repeated-Trial Comparison

This article has 5 authors:
1. Kang Wan Jing
2. Jonathan Sim
3. Eugene Loh Eu-Min
4. Arthur Lim Chong Yang
5. Kelvin Weng Chiong Foong
This article has no evaluationsLatest version May 20, 2026
Artificial Intelligence-Powered Spatial Analysis for Interpretable Bone Loss Assessment in Cone-Beam Computed Tomography

This article has 14 authors:
1. Eduardo Luiz Delamare
2. Laura Swinckels
3. Xun Li
4. Juan David Osorio
5. Katharina Alves Rabelo
6. Zimo Huang
7. Sen Le
8. Khoa Le
9. Samuel Khela
10. Shenghong Li
11. Changming Sun
12. Dadong Wang
13. Axel Spahr
14. Heiko Spallek
This article has no evaluationsLatest version Apr 14, 2026
GF-Predictability for Dental Implants (GF-PreDImp): A Multidomain Predictive Model for Dental Implant Success—Development, Structure and Clinical Application (Project Report)

This article has 3 authors:
1. Gustavo Vicentis Oliveira Fernandes
2. Juliana Campos Hasse Fernandes
3. Sérgio A. Gehrke
This article has no evaluationsLatest version May 21, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Accuracy and Consistency of Frontier LLMs on Orthodontic Diagnostic Tasks: A Repeated-Trial Comparison

Artificial Intelligence-Powered Spatial Analysis for Interpretable Bone Loss Assessment in Cone-Beam Computed Tomography

GF-Predictability for Dental Implants (GF-PreDImp): A Multidomain Predictive Model for Dental Implant Success—Development, Structure and Clinical Application (Project Report)