Machine Learning–Based Analysis of Factors Associated With Colonoscopy Screening Adherence and Development of a Predictive Model Among High-Risk Individuals for Colorectal Cancer

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: Completing colonoscopy among individuals at high risk for colorectal cancer (CRC) is essential to improve the effectiveness of early detection and treatment programs; however, colonoscopy adherence remains low. Practical prediction tools are needed to support risk stratification and targeted management. Objective: To identify factors associated with colonoscopy adherence among high-risk individuals for CRC and to develop and validate a machine learning–based model to predict the probability of adherence. Methods: High-risk participants were retrospectively extracted from the Bengbu Colorectal Cancer Early Diagnosis and Treatment Program screening database. A total of 1,267 high-risk individuals screened from June to December 2021 were assigned to the training set, and 1,235 screened from January to December 2022 constituted the validation set. Colonoscopy completion was defined as the outcome. Univariate comparisons were performed using the chi-square test. Variables selected in univariate analysis were entered into least absolute shrinkage and selection operator (LASSO) regression for feature selection, and 15 machine learning models were then developed. Model performance was evaluated using bootstrap-based receiver operating characteristic (ROC) analysis, area under the ROC curve (AUC), precision–recall (PR) curves, calibration curves, and decision curve analysis. Shapley Additive Explanations (SHAP) were applied to interpret the optimal model. Results: A total of 2,502 high-risk individuals were included, and the colonoscopy adherence rate was 27.09%. LASSO and multivariable logistic regression identified age 50–59 years and 60–69 years, meat intake habit, chronic diarrhea, mucus or bloody stool, and a first-degree family history of CRC as significant correlates of high adherence, with first-degree CRC family history showing the strongest association (OR = 16.180). Among the 15 machine learning models, XGBoost achieved the highest AUC in the validation set (0.846, 95% CI 0.840–0.850) with favorable sensitivity and F1-score. SHAP analysis indicated that first-degree CRC family history contributed most to model output, followed by gastrointestinal symptom cues and age. Conclusions: Colonoscopy adherence among high-risk individuals for CRC in Bengbu was low (27.09%). The machine learning model built on five key features demonstrated good discrimination and interpretability, and may serve as a quantitative tool to support targeted interventions and optimized resource allocation in CRC screening pathways.

Article activity feed