A Comparative Analysis in a Clinical Cohort: Multiple Imputation by Chained Equations and a Novel Super Learner-Based Imputation Approach
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Missing data is a challenge in clinical research, especially in real-world data (RWD), where complete case analysis can bias results and reduce power. Ensemble learning approaches like Super Learner (SL) show strong numerical performance for prediction problems, but their use for missing value imputation (MVI) in oncology datasets is unexplored. We sought to develop and evaluate a novel SL-based imputation function that can impute multiple variables and quantify uncertainty.
Methods
We analyzed two independent cohorts of acute myeloid leukemia patients (n=1641). The SL-based MVI function includes data processing, predictor selection, binary and continuous variable pipelines, and performance measurement. Performance was compared to multiple imputation by chained equations (MICE) using balanced accuracy, F 1 -score, root mean square error (RMSE), and visualizations.
Results
In a numerical experiment with 9 clinically important features, the proposed MVI function imputed and achieved higher balanced accuracy than MICE for 7/9 variables (mean balanced accuracy 89.04% vs 80.75%) with comparable performance for other variables. The continuous variable SL ensemble showed comparable RMSE (582) relative to MICE (597).
Conclusions
This study demonstrates that the SL-based imputation function improves accuracy over MICE in high-dimensional RWD while providing novel, observation-level uncertainty quantification.