Mission imputable: Effects of missing data processing on infectious disease detection and prognosis

Suravi Saha Roy
Ngoc Thi Nguyen
Agustin Zuniga
Fatemeh Sarhaddi
Eemil Lagerspetz
Huber Flores
Petteri Nurmi

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Missing data in medical datasets poses significant challenges for developing effective AI/ML pipelines. Inaccurate imputation can lead to biased results, reduced model performance, and compromised clinical insights. Understanding how different imputation methods affect AI/ML model performance is crucial for ensuring accurate clinical findings.

Objective

This study systematically investigates the effects of different imputation methods on AI/ML model performance and the clinical implications of these methods.

Methods

We investigate the impact of four different missing data strategies on the performance of common classification algorithms for analyzing medical data. The performance was evaluated based on sensitivity and specificity metrics for the tasks of predicting COVID-19 diagnosis and patient deterioration. We also perform feature analysis to understand the clinical implications the choice of imputation method has.

Results

Our findings show that the choice of imputation method significantly affects the performance of AI/ML techniques and the clinical conclusions drawn from the data. The optimal handling of missing values depends on (i) the composition of the features with missing values, (ii) the rate of missing values, and (iii) the pattern of the missing features. Using COVID-19 diagnosis and patient deterioration as representative examples of clinical tasks, our results indicate that MICE imputation yields the best overall performance, resulting in a 26% improvement in accuracy compared to baseline methods. Specifically, for predicting COVID-19 diagnosis, we achieved a sensitivity of 81% and specificity of 98%, while for patient deterioration, the sensitivity was 65% and specificity was 99%.

Conclusion

This study demonstrates the critical impact of missing data imputation on AI/ML model performance and the clinical insights derived from these models. Our findings underscore the importance of selecting appropriate imputation techniques tailored to the specific characteristics of medical data to ensure accurate and reliable AI/ML predictions.

Version published to 10.1101/2025.02.15.25322351v1 on medRxiv
Feb 16, 2025

Mind the Gaps: Guess Less, Predict More with Missing Medical Data

This article has 2 authors:
1. Aashish Bhandari
2. Sonika Tyagi
This article has no evaluationsLatest version Mar 17, 2025
ImputeBench: Benchmarking Single Imputation Methods

This article has 5 authors:
1. Robin Richter
2. Juliana F. Tavares
3. Anne Miloschewski
4. Monique M. B. Breteler
5. Sach Mukherjee
This article has no evaluationsLatest version Feb 3, 2025
The effect of population selection criteria on model estimates and data missingness in electronic health record studies

This article has 11 authors:
1. Emma Pritchard
2. Karina-Doris Vihta
3. Koen B. Pouwels
4. Samuel Lipworth
5. Russell Hope
6. Berit Muller-Pebody
7. T. Phuong Quan
8. Jack Cregan
9. Susan Hopkins
10. David W. Eyre
11. A. Sarah Walker
This article has no evaluationsLatest version Feb 12, 2025

Listed in

Abstract

Background

Objective

Methods

Results

Conclusion

Article activity feed

Related articles

Mind the Gaps: Guess Less, Predict More with Missing Medical Data

ImputeBench: Benchmarking Single Imputation Methods

The effect of population selection criteria on model estimates and data missingness in electronic health record studies