Mind the Gaps: Guess Less, Predict More with Missing Medical Data
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Healthcare data, generally available as electronic health records (EHR), provide a rich profile of an individual’s health and lifestyle. This data can be harnessed for predictive modelling using machine-learning approaches. A common challenge in using EHR data is the prevalence of missing information. Missingness can occur in three primary ways: completely at random (MCAR), at random (MAR), and not at random (MNAR). A typical approach to deal with missingness during predictive modelling is through imputation, which could be statistical or learning-based. However, with the imputation approach, we run the risk of changing the original distribution of the data attributes. This can lead to serious issues, as even small changes in healthcare data can negatively impact the accuracy of predictions made through modelling. As a result, alternative modelling-based approaches have been explored. In this study, we use four machine learning models across three datasets with different missingness types and levels while ensuring the original data distribution remains unchanged. We provide a thorough comparison and insights into these modelling techniques to showcase how they offer robust solutions for handling missing data in healthcare applications. This work informs strategies to assess various levels and patterns of missing data and seamlessly integrate their handling into the machine learning pipeline.