Accuracy and Consistency of Frontier LLMs on Orthodontic Diagnostic Tasks: A Repeated-Trial Comparison
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Importance
Large language models are increasingly explored as clinical decision-support tools in orthodontics, yet existing evaluations have been confined to knowledge-based question answering where reported accuracy ranges from 18% to 100%. No study has evaluated performance on the computational and classificatory tasks that define daily diagnostic work. Furthermore, 84.3% of published healthcare large language model studies fail to report the number of repeated queries performed, leaving output stochasticity unexamined.
Objective
To compare the diagnostic accuracy and output consistency of three frontier reasoning-enhanced large language models, namely, ChatGPT 5.4 (Thinking), Gemini 3 (Thinking), and Claude Opus 4.6 (Extended Thinking), on Bolton analysis, Index of Orthodontic Treatment Need-Dental Health Component (IOTN-DHC) classification, space analysis, and lateral cephalometric interpretation.
Methods
In this comparative cross-sectional study with a repeated-measures design, each model, accessed through its respective consumer-facing web interfaces under default provider settings rather than through application programming interfaces, processed 200 purpose-built items (50 per task) across four independent trials, yielding 2,400 observations. Responses were scored against a pre-established reference standard by two independent raters using strict binary exact-match criteria. Accuracy was reported with exact binomial 95% confidence intervals. Inter-model comparisons used Cochran’s Q test with post-hoc McNemar’s tests and Bonferroni correction. A supplementary context-rich prompting evaluation was conducted on 40 items (480 observations).
Results
Claude Opus 4.6 (Extended Thinking) achieved the highest accuracy (99.0%; 95% CI: 96.4–99.9%), followed by Gemini 3 (Thinking) (95.5%; 91.6–98.1%) and ChatGPT 5.4 (Thinking) (94.0%; 89.8–96.9%) (Cochran’s Q = 6.87, p = 0.032). Each model exhibited distinct, non-overlapping error profiles concentrated at the normal–abnormal classification boundary. An accuracy–consistency paradox emerged: the most accurate model was the least consistent (93.0%), while the least accurate was the second-most consistent (98.0%). Context-rich prompting eliminated all errors across all three models.
Interpretation
Frontier reasoning large language models achieved high overall accuracy on orthodontic diagnostic tasks but retained concealed, task-specific vulnerabilities detectable only through repeated-trial evaluation. An accuracy-consistency paradox, in which the most accurate model was the least consistent, demonstrates that single-trial evaluations cannot characterise clinical risk. The reasoning modes were associated with high arithmetic accuracy but did not compensate for imprecise parametric knowledge on classification tasks; however, the absence of a non-thinking baseline means this association cannot be attributed to the thinking mode itself. Context-rich prompting eliminated all errors on synthetic data but should be regarded as a necessary yet insufficient prerequisite for clinical deployment pending prospective validation on real patient data.