A Comparative Analysis in a Clinical Cohort: Multiple Imputation by Chained Equations and a Novel Super Learner-Based Imputation Approach

Tony Zbysinski
Lezhou Wu
Justin Dale
James Coates
Karan Sapiah
Jamie Reuben
Frank Markson
Ujjwal Kulkarni
Nazmul Islam

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Missing data is a challenge in clinical research, especially in real-world data (RWD), where complete case analysis can bias results and reduce power. Ensemble learning approaches like Super Learner (SL) show strong numerical performance for prediction problems, but their use for missing value imputation (MVI) in oncology datasets is unexplored. We sought to develop and evaluate a novel SL-based imputation function that can impute multiple variables and quantify uncertainty.

Methods

We analyzed two independent cohorts of acute myeloid leukemia patients (n=1641). The SL-based MVI function includes data processing, predictor selection, binary and continuous variable pipelines, and performance measurement. Performance was compared to multiple imputation by chained equations (MICE) using balanced accuracy, F ₁ -score, root mean square error (RMSE), and visualizations.

Results

In a numerical experiment with 9 clinically important features, the proposed MVI function imputed and achieved higher balanced accuracy than MICE for 7/9 variables (mean balanced accuracy 89.04% vs 80.75%) with comparable performance for other variables. The continuous variable SL ensemble showed comparable RMSE (582) relative to MICE (597).

Conclusions

This study demonstrates that the SL-based imputation function improves accuracy over MICE in high-dimensional RWD while providing novel, observation-level uncertainty quantification.

Version published to 10.1101/2025.11.26.25341082 on medRxiv
Nov 28, 2025

Missing Data in OHCA Registries: How Multiple Imputation Methods Affect Research Conclusions—Paper II

This article has 4 authors:
1. Stella Jinran Zhan
2. Seyed Ehsan Saffari
3. Marcus Eng Hock Ong
4. Fahad Javaid Siddiqui
This article has no evaluationsLatest version Jan 16, 2026
Generative AI-Based Imputation to Preserve Data Fidelity and Enhance Outcome Prediction: A Multi-Institutional Study in Cardiac Surgery

This article has 11 authors:
1. Negin Maddah
2. Amin Ramezani
3. Qingchu Jin
4. Jakob Wollborn
5. Akinobu Itoh
6. Jaime B. Rabb
7. Felistas Mazhude
8. Robert S. Kramer
9. Douglas B. Sawyer
10. Raimond L. Winslow
11. Farhad R. Nezami
This article has no evaluationsLatest version Jan 23, 2026
Evaluating Imputation Methods for Handling Missing Data in Complex Survey Designs: Evidence from the India DHS 2017–18

This article has 6 authors:
1. Mahfuzer Rohman
2. Md Sabbir Hossain
3. Md Fakrul Islam
4. Prosenjit Basak Arka
5. Md Rafi Hasan
6. Md Jamal Uddin
This article has no evaluationsLatest version Jan 23, 2026

Discuss this preprint

Listed in

Abstract

Background

Methods

Results

Conclusions

Article activity feed

Related articles

Missing Data in OHCA Registries: How Multiple Imputation Methods Affect Research Conclusions—Paper II

Generative AI-Based Imputation to Preserve Data Fidelity and Enhance Outcome Prediction: A Multi-Institutional Study in Cardiac Surgery

Evaluating Imputation Methods for Handling Missing Data in Complex Survey Designs: Evidence from the India DHS 2017–18