A selective machine learning algorithm for severe periodontitis labeling from questionnaire data

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Epidemiological cohorts often collect self-reported oral health (SROH) questionnaires but lack clinical periodontal measurements. We developed a selective, explainable machine learning (ML) pipeline that can assign labels for severe periodontitis (SP) or no periodontitis (NP). Three datasets (n = 498) with SROH questionnaires, demographics, and Community Periodontal Index of Treatment Needs (CPITN) scores were used to derive NP, moderate periodontitis (MP), and SP categories. MP cases were excluded from model development. After cleaning and feature engineering, non-similar label duplicates were removed. A CatBoost model (Separator-A) was trained with 10-fold cross-validation; NP/SP predictions were retained when probability ≥ 0.85. From these outputs and domain rules, a rule-consistent subset was created to train a second model (Separator-Z). Performance was evaluated on internal test and hold-out inference sets. Next, the pipeline was applied to MP cases. Separator-A achieved AUROC 0.90 (internal validation) and 0.95 (hold-out inference). Separator-Z showed perfect discrimination across all sets (all AUROC, sensitivity, specificity, and F1 are 1.00). No MP case was misclassified as NP or SP, while NP/SP labeling remained highly precise, albeit with reduced coverage. Thus, a two-stage, explainable ML pipeline can selectively identify SP and NP from SROH questionnaire data, supporting case–control selection in cohorts without clinical periodontal examinations.

Article activity feed