Predicting High-Risk Colorectal Polyps Using Pre-Colonoscopy Features: Machine Learning Model Development and Validation

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Purpose :Risk stratification for advanced colorectal polyps typically relies on colonoscopy and/or pathology findings. However, there is growing interest in whether non-invasive features available prior to colonoscopy can help identify patients at higher risk. Such approaches may enhance clinical decision-making by prioritizing surveillance for individuals most likely to harbor high-risk polyps, when colonoscopy resources are limited while potentially reducing unnecessary procedures in lower-risk patients. Importantly, the use of non-invasive, pre-procedural information may also help promote more equitable access to risk stratification, particularly in settings where colonoscopy resources are limited or unevenly distributed. We aimed to develop and externally validate machine learning models to predict high-risk colorectal polyps using only non-invasive, pre-colonoscopy demographic, clinical, and behavioral features in a diverse, predominantly African American, urban cohort. Methods : We conducted a retrospective cohort study using demographic, lifestyle, and comorbidity data from patients who underwent colonoscopy at Howard University Hospital to develop and validate several machine learning models, including neural networks, random forest, support vector machines (SVM), Naïve Bayes, logistic regression, decision trees, k-nearest neighbors (KNN), and XGBoost, for predicting high-risk colorectal polyps. High-risk polyps (HRP) were defined as villous or tubullovillous adenomas, high-grade dysplasia, polyps $\geq$ 10 mm in size, and/or the presence of $\geq$ 3 polyps per procedure; all other cases were classified as low-risk polyps (LRP). The dataset included 4,681 patients from 2015-2022 used for internal validation and 1,562 patients from 2023-2024 used for external validation. Model performance was evaluated using the area under the receiver operating characteristic curve (ROC-AUC), precision-recall area under the curve (PR-AUC), accuracy, precision, recall, and F1 score. Model interpretability and feature contribution were assessed using SHapley Additive exPlanations (SHAP). Results : Overall predictive performance was moderate using non-invasive pre-colonoscopy features. The neural network demonstrated the strongest overall discrimination, achieving the highest internal validation performance (ROC-AUC 0.78, PR-AUC 0.75, accuracy 0.72), but showed reduced performance in the external cohort (ROC-AUC 0.67, accuracy 0.66), suggesting potential overfitting or temporal feature drift. In contrast, simpler models including Naïve Bayes, SVM, and XGBoost exhibited lower internal performance (ROC-AUC 0.54-0.59) but more stable generalization to the external cohort (ROC-AUC 0.52-0.63; accuracy approximately 0.53-0.60). Model interpretability analysis using SHAP identified age, smoking status, sex, occupation, race, colonoscopy indication, and family history of colorectal cancer as the most influential predictors, highlighting contributions from both traditional clinical and sociodemographic factors. Conclusions :Prediction of HRP using routine pre-colonoscopy data is feasible but demonstrates limited generalizability across cohorts. These findings highlight the clinical potential and limitations of pre-procedural risk modeling, especially in diverse, underserved populations. Integration of additional data modalities may be required to achieve clinically robust, and equitable prediction tools.

Article activity feed