Mission imputable: Effects of missing data processing on infectious disease detection and prognosis
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Missing data in medical datasets poses significant challenges for developing effective AI/ML pipelines. Inaccurate imputation can lead to biased results, reduced model performance, and compromised clinical insights. Understanding how different imputation methods affect AI/ML model performance is crucial for ensuring accurate clinical findings.
Objective
This study systematically investigates the effects of different imputation methods on AI/ML model performance and the clinical implications of these methods.
Methods
We investigate the impact of four different missing data strategies on the performance of common classification algorithms for analyzing medical data. The performance was evaluated based on sensitivity and specificity metrics for the tasks of predicting COVID-19 diagnosis and patient deterioration. We also perform feature analysis to understand the clinical implications the choice of imputation method has.
Results
Our findings show that the choice of imputation method significantly affects the performance of AI/ML techniques and the clinical conclusions drawn from the data. The optimal handling of missing values depends on (i) the composition of the features with missing values, (ii) the rate of missing values, and (iii) the pattern of the missing features. Using COVID-19 diagnosis and patient deterioration as representative examples of clinical tasks, our results indicate that MICE imputation yields the best overall performance, resulting in a 26% improvement in accuracy compared to baseline methods. Specifically, for predicting COVID-19 diagnosis, we achieved a sensitivity of 81% and specificity of 98%, while for patient deterioration, the sensitivity was 65% and specificity was 99%.
Conclusion
This study demonstrates the critical impact of missing data imputation on AI/ML model performance and the clinical insights derived from these models. Our findings underscore the importance of selecting appropriate imputation techniques tailored to the specific characteristics of medical data to ensure accurate and reliable AI/ML predictions.