An Ensemble-Base Machine Learning Approach to Predict 2- and 10-Year Breast Cancer

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Accurate prediction of breast cancer recurrence remains difficult because prognosis varies significantly across molecular subtypes, and genomic tests are often expensive or unavailable. As a result, many patients are assigned to broad risk categories that may lead to overtreatment or undertreatment. This underscores the need for scalable, affordable, and interpretable prognostic tools. We developed machine learning models integrating routine hematological indices with clinicopathologic data to predict 2- and 10-year recurrence or death. We retrospectively analyzed 4,277 women with primary breast cancer (2008–2022) from a single institution The cohort included hormone receptor-positive (HR+; 60%), HER2-positive (21%), and triple-negative (TNBC; 18%) subtypes. We trained multiple classifiers and integrated them into a stacked ensemble using logistic regression as the final learner. Class imbalance was addressed with SMOTE applied only to training sets. The ensemble achieved strong discrimination: general cohort AUC 0.859 (2-year) and 0.814 (10-year), with specificity 88–86% and sensitivity 67–59%. Subtype-specific performance remained robust across both time horizons: HR + AUC 0.862/0.804, HER2 + AUC 0.892/0.831, and TNBC AUC 0.834/0.829 (2-year/10-year respectively). SHAP analysis identified advanced tumor stage, elevated inflammatory ratios (NLR, PLR, MLR), elevated red cell distribution width, and older age as key adverse predictors with stronger effects on early recurrence. This interpretable tool uses only routine blood tests and clinico-pathological data, requiring no additional infrastructure. It identifies patients suitable for treatment de-escalation while flagging high-risk patients for intensified therapy, especially useful where genomic testing is unavailable.

Article activity feed