Bayesian Network Structure Learning from Incomplete Breast Cancer Data Using Structural Expectation–Maximization
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Breast cancer is one of the most common malignancies worldwide, and recent reports from Iran indicate rising incidence and mortality. Data-driven analytic methods are increasingly used to support clinical decision-making; however, medical datasets typically contain substantial missingness. In this study, we apply Structural Expectation-Maximization (SEM), an efficient approach for model learning with incomplete data, to discover Bayesian network structures from simulated datasets and a real breast cancer dataset. We also compare SEM against Multiple Imputation by Chained Equations (MICE), one of the most widely used imputation strategies. In the simulation study, both SEM and MICE achieved high accuracy, but SEM provided greater sensitivity and F1-scores, particularly under Missing-at-Random (MAR) and Missing-Not-at-Random (MNAR) mechanisms and at higher levels of missingness. For the real dataset---clinical, pathological, and demographic data from approximately 2{,}000 Iranian women with \((\sim)\)10% missingness---SEM alone was used. The resulting Bayesian network exhibited clinically interpretable dependencies: the number of involved lymph nodes depended on tumor size, disease stage, and axillary surgery; tumor size was linked to surgical modality and radiotherapy; disease stage influenced chemotherapy, tumor grade, lymphovascular invasion, and pathological type; and molecular subtype was associated with hormonal therapy. These relationships are consistent with established oncological knowledge, demonstrating the utility of SEM in structural discovery under incomplete medical data.