Understanding Cancer Risk Among Bangladeshi Women: An Explainable Machine Learning Approach to Socio-Reproductive Factors Using Tertiary Hospital Data
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: Breast cancer poses a significant health challenge in Bangladesh, where limited screening and unique reproductive patterns contribute to delayed diagnoses and subtype-specific disparities. While reproductive risk factors such as age at menarche, parity, and contraceptive use are well-studied in high-income countries, their associations with hormone receptor-positive (HR+) and triple-negative breast cancer (TNBC) remain underexplored in low-resource settings. Methods: A case-control study was conducted at the National Institute of Cancer Research and Hospital (NICRH) including 486 histopathologically confirmed breast cancer cases (246 HR+, 240 TNBC) and 443 cancer-free controls. Socio-demographic and reproductive data were collected through structured interviews. Machine learning models—including Logistic Regression, Lasso, Support Vector Machines, Random Forest, and XGBoost—were trained using stratified 5-fold cross-validation. Model performance was evaluated using sensitivity, F1-score, and AUC. Feature importance was interpreted using Shapley values. Results: XGBoost achieved the highest overall performance (F1-score = 0.750). Rural residence, low education (≤5 years), and undernutrition were significant predictors across subtypes. Cesarean delivery and multiple abortions were more predictive of TNBC, while urban residence, employment, and higher education were more predictive of HR+. Age at menarche and age at first childbirth showed decreasing predictive importance with increasing age for HR+, while larger gaps between marriage and childbirth were more predictive of TNBC.