ImputeBench: Benchmarking Single Imputation Methods
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Biomedical data often contain missing values and in many applications missing value imputation (MVI) is an important part of the data analysis work-flow. However, the performance of MVI methods depends on details of the joint distribution of data and missingness patterns that are typically unknown in practice, making an a priori choice of MVI method challenging. Furthermore, technical assumptions underlying MVI methods can be hard to directly verify in practice. Motivated by these issues, in this paper, we propose an approach for the context-specific selection of MVI methods. Due to the fact that different methods may work well in different cases we argue for a move away from a “one size fits all” view and put forward in this paper a standardized, empirical approach in which MVI methods are benchmarked in the specific context of a problem of interest. We connect our work to the large body of MVI research, along the way refining definitions of missing at random and missing not at random and providing a detailed review of existing work on benchmarking. Our approach can be tailored to reflect specific assumptions on missingness patterns, allowing for application in diverse applied problems. Furthermore, in addition to using real data, we study benchmarking via data simulation spanning a broad range of properties, such as latent factors, non-linearity and multi-modality, with interpretable simulation parameters that are amenable to user specification. The approaches we propose can be used to (i) select an MVI method for a given data set or (ii) benchmark a novel MVI method across a range of regimes. Alongside the general protocol, we provide a specific, reproducible implementation (in the R-package ImputeBench , available under github.com/richterrob/ImputeBench) that gives users a ready-to-use tool for MVI selection and assessment. We illustrate the use of ImputeBench to study the behaviour of a range of existing imputation methods (k-nn, soft impute, missForest, MICE) in the context of real data from an ongoing large-scale population-level study.