ImputeBench: Benchmarking Single Imputation Methods

Robin Richter
Juliana F. Tavares
Anne Miloschewski
Monique M. B. Breteler
Sach Mukherjee

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Biomedical data often contain missing values and in many applications missing value imputation (MVI) is an important part of the data analysis work-flow. However, the performance of MVI methods depends on details of the joint distribution of data and missingness patterns that are typically unknown in practice, making an a priori choice of MVI method challenging. Furthermore, technical assumptions underlying MVI methods can be hard to directly verify in practice. Motivated by these issues, in this paper, we propose an approach for the context-specific selection of MVI methods. Due to the fact that different methods may work well in different cases we argue for a move away from a “one size fits all” view and put forward in this paper a standardized, empirical approach in which MVI methods are benchmarked in the specific context of a problem of interest. We connect our work to the large body of MVI research, along the way refining definitions of missing at random and missing not at random and providing a detailed review of existing work on benchmarking. Our approach can be tailored to reflect specific assumptions on missingness patterns, allowing for application in diverse applied problems. Furthermore, in addition to using real data, we study benchmarking via data simulation spanning a broad range of properties, such as latent factors, non-linearity and multi-modality, with interpretable simulation parameters that are amenable to user specification. The approaches we propose can be used to (i) select an MVI method for a given data set or (ii) benchmark a novel MVI method across a range of regimes. Alongside the general protocol, we provide a specific, reproducible implementation (in the R-package ImputeBench , available under github.com/richterrob/ImputeBench) that gives users a ready-to-use tool for MVI selection and assessment. We illustrate the use of ImputeBench to study the behaviour of a range of existing imputation methods (k-nn, soft impute, missForest, MICE) in the context of real data from an ongoing large-scale population-level study.

Version published to 10.1101/2025.02.02.25321536 on medRxiv
Feb 3, 2025

Missing Data in OHCA Registries: How Multiple Imputation Methods Affect Research Conclusions—Paper II

This article has 4 authors:
1. Stella Jinran Zhan
2. Seyed Ehsan Saffari
3. Marcus Eng Hock Ong
4. Fahad Javaid Siddiqui
This article has no evaluationsLatest version Jan 16, 2026
Evaluating Imputation Methods for Handling Missing Data in Complex Survey Designs: Evidence from the India DHS 2017–18

This article has 6 authors:
1. Mahfuzer Rohman
2. Md Sabbir Hossain
3. Md Fakrul Islam
4. Prosenjit Basak Arka
5. Md Rafi Hasan
6. Md Jamal Uddin
This article has no evaluationsLatest version Jan 23, 2026
Ten Quick Tips for Biomedical Federated Learning

This article has 8 authors:
1. Kyle Ellrott
2. Venkat S. Maladi
3. Jean-Christophe Bélisle-Pipon
4. Emek Demir
5. Yael Bensoussan
6. Serghei Mangul
7. Alex A. T. Bui
8. Paul C. Boutros
This article has no evaluationsLatest version Jan 27, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Missing Data in OHCA Registries: How Multiple Imputation Methods Affect Research Conclusions—Paper II

Evaluating Imputation Methods for Handling Missing Data in Complex Survey Designs: Evidence from the India DHS 2017–18

Ten Quick Tips for Biomedical Federated Learning