Prediction of Distant Metastasis of Colorectal Cancer by Machine Learning Modeling: a Retrospective Study

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Objective This study aims to establish a machine learning (ML) model for predicting the risk of distant metastasis in colorectal cancer (CRC). Methods The pathological data of 104436 patients (2858 of them coming from China) diagnosed with colorectal cancer (denoted as CRC) in the United States of America between years 2010 and 2021 were extracted from the Surveillance, Epidemiology, and End Results (SEER) database (built by the National Institute of Health of the United States of America). They were classified into two groups (the whole group (coded as G w ) covering all the 104436 patients and the branch group (coded as G b ) covering 2858 patients coming from China) and treated separately with 8 machine learning (ML) algorithms to establish the machine learning models for predicting the distant metastasis risk of CRC. The as-established ML models were evaluated based on accuracy, recall, and the area under the receiver operating characteristic (ROC) curve (AUC); and those possessing the best predicting performance for G w and G b were identified accordingly. Besides, the clinical-pathological features and their relationships with the target variables were analyzed by using the identified best models, while their performance and generalizability were externally validated by inputting the pathological data of 109 CRC patients from the Huaihe Hospital of Henan University (Kaifeng, China). Furthermore, a web-based calculator was developed with the identified best model to predict the distant metastasis risk of CRC patients in China. Results A total of 104436 patients from the general population (including 2858 patients from Chinese population) were included, of which 7864 cases (15.3%) were found to have a high risk of distant metastasis. Among the ML models developed for Chinese population, the Gradient Boosting model showed the best predictive performance, exhibiting an area under the receiver operating characteristic (ROC) curve (AUC) of 0.9571 in the internal test set. As to the models pointing to the general population, the XGBoost model performed best, exhibiting an AUC value of 0.9757 in the internal test set. In terms of the external validation of the as-established ML models, the model built with the pathological data of 109 CRC patients from the Huaihe Hospital of Henan University (Kaifeng, China) outperformed the one built with the pathological data of G w . Namely, they provided accuracy rates of 0.9083 and 0.8716, precision rates of 0.9060 and 0.8680, recall rates of 0.9083 and 0.8716, and F1 scores of 0.9067 and 0.8694, respectively. Conclusion This study developed a Gradient Boosting-based model for predicting the risk of distant metastasis in colorectal cancer, providing an effective clinical decision support tool for physicians.

Article activity feed