Breast cancer prediction modeling based on SHAP interpretability analysis and XGBoost algorithm

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Purpose To compare the predictive effectiveness and risk factor screening of extreme gradient ascent (XGBoost) model and four commonly used machine learning models for breast cancer diagnosis, and to interpret the model results by SHAP interpretability analysis. Materials and methods Breast tumor data from the UCI public database were used to screen the characteristic factors using the heat map of the correlation coefficient matrix, and five machine learning algorithms, XGBoost, Random Forest, K-Nearest Neighbors, Decision Tree, and Support Vector Machines, were compared by precision, recall, F1 value, and accuracy. The ROC curves of the five models were plotted, and the confusion matrix was used to classify the prediction results, resulting in the best-performing model, XGBoost. the XGBoost model, the decision tree model, and the random forest model were used to derive the order of importance of the feature factors, and an interpretability analysis was performed through the SHAP model to derive the important feature factors affecting the occurrence of breast cancer. Results The results of ROC curve showed that the accuracy of XGBoost model in the test set was 97.4%, the decision tree model was 91.2%, the random forest model was 95.6%, the neighborhood algorithm model was 94.7%, and the support vector machine model was 92.1%. The confusion matrix plot also gives 97.3% accuracy for the XGBoost model, 89.5% for the decision tree model, 95.6% for the random forest model, 94.7% for the proximity algorithm model, and 92.1% for the support vector machine model. the results of the feature importance scores of the three models, the first important feature is radius-worst. The SHAP interpretable model results showed that the main drivers for high risk patients were radius-worst,concave points-worst,concavity-worst.Also radius-worst interacted with concave points-worst. Conclusions XGBoost algorithm model is more accurate compared with traditional machine learning model, radius-worst is an important factor affecting breast cancer occurrence, and its interaction with concave points-worst exists.

Article activity feed