Predicting Early-Onset Colorectal Cancer in Individuals Below Screening Age Using Machine Learning and Real-World Data

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background

Colorectal cancer (CRC) is now the leading cause of cancer-related deaths among young Americans. Our study aims to predict early-onset CRC (EOCRC) using machine learning (ML) and structured electronic health record (EHR) data for individuals under the screening age of 45.

Methods

We identified a cohort of patients under 45 from the OneFlorida+ Clinical Research Consortium. Given the distinct pathology of colon cancer (CC) and rectal cancer (RC), we created separate prediction models for each cancer type with various ML algorithms. We assessed multiple prediction time windows (0, 1, 3, and 5 years) and ensured robustness through propensity score matching (PSM) to account for confounding variables. Model performance was assessed using established metrics. Additionally, we employed the Shapley Additive exPlanations (SHAP) to identify risk factors for EOCRC.

Results

Our study yielded results, with Area Under the Curve (AUC) scores of 0.811, 0.748, 0.689, and 0.686 for CC prediction, and 0.829, 0.771, 0.727, and 0.721 for RC prediction at 0, 1, 3, and 5 years, respectively. Notably, predictors included immune and digestive system disorders, along with secondary cancers and underweight, prevalent in both CC and RC groups. Blood diseases emerged as prominent indicators of CC.

Conclusion

This study highlights the potential of ML techniques in leveraging EHR data to predict EOCRC, offering valuable insights for potential early diagnosis in patients who are below the recommended screening age.

Article activity feed