DentaCoPilot: An LLM-Augmented Next-Procedure Recommender for General Dentistry, Designed for Dentist Augmentation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Commercial dental artificial intelligence in 2026 is overwhelmingly diagnostic, focusing on caries, calculus, periapical, and bone-level detection on radiographs. The clinically harder question that follows every diagnosis-given a patient's chart and most recent procedure, what should the dentist do next?-remains unsolved at a general dentistry scale. The closest published system, MultiTP (Chen et al., 2024), is a CNN-RNN restricted to partial-edentulism cases and provides neither calibrated uncertainty, structured rationale, nor an evaluation that treats the model as decision support rather than as an autonomous classifier. We introduce DentaCoPilot, a recommender that, given a structured chart, returns (i) a calibrated top-K probability distribution over Current Dental Terminology (CDT) codes for the next procedure, (ii) a verbalized confidence label, (iii) an explicit abstain flag when context is insufficient, and (iv) a chart-grounded rationale. We compare four classical baselines (frequency bigram, TF-IDF + logistic regression, XGBoost, and a MultiTP-style CNN-RNN) and six large language model (LLM) variants (Claude Haiku, Sonnet + chain-of-thought, Sonnet + retrieval, Opus + chain-of-thought, Sonnet + classical prior, and Opus + classical prior) on a synthetic chart corpus of 500 patients (1,284 test examples). We further establish real-world external validity using the public MEPS 2023 Dental Visits corpus, comprising 11,016 next-visit transitions across 5,088 patients at treatment-category granularity. All LLM inference is routed through the local Anthropic Claude Code CLI, and every call is logged for auditability. On an apples-to-apples evaluation, classical baselines achieve 0.567 top-1 and 0.967 top-5 accuracy, whereas pure LLM variants achieve 0.267-0.467 top-1 accuracy. Prompt-conditioning a Sonnet LLM on the classical baseline's top-10 candidates (M5) substantially closes the gap: top-5 accuracy increases from 0.733 (pure Sonnet + chain-of-thought) to 0.933, approaching classical baseline performance on the synthetic dataset while preserving rationale generation and abstention capabilities. Because the synthetic generator shares Markov structure with the bigram baseline, these synthetic accuracy rankings should be interpreted as indicators of pipeline behavior rather than externally valid clinical performance. Increasing the LLM backbone from Sonnet to Opus does not improve accuracy, with or without priming. Calibration through temperature scaling and coverage-risk analysis is also reported for the baselines. On the real-world MEPS corpus, history-aware baselines achieve 0.479 top-1 and 0.781 top-3 accuracy (macro-F1 = 0.42), compared with 0.297 top-1 accuracy for a most-frequent baseline, confirming that procedure history is predictive of subsequent treatment categories. At the category granularity available in MEPS, a zero-shot LLM given only coarse procedure history underperforms the Markov baseline (0.213 vs. 0.533 top-1 accuracy; n = 150), indicating that the value of LLMs depends on access to richer chart information rather than category history alone. Prompt-conditioning a small LLM on a classical baseline's top-K predictions emerges as the most cost-effective LLM design evaluated for next-procedure recommendation, while preserving the augmentation features that distinguish the system from an autonomous classifier. Real-world external validity for the task is established using the public MEPS corpus. A pre-registered clinician-in-the-loop evaluation at the KLE Vishwanath Katti Institute of Dental Sciences (Belgaum, India), together with CDT-level multi-institutional validation under institutional data-use agreements, constitutes the next stage of this work.