To impute or not to impute in untargeted metabolomics - that is the compositional question
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Untargeted metabolomics often produce large datasets with missing values, arising from biological or technical factors, which can undermine statistical analyses and lead to biased biological interpretations. Imputation methods, such as k-Nearest Neighbors (kNN) and Random Forest (RF) regression are commonly used but their effects vary depending on the type of missing data e.g. Missing Completely At Random (MCAR) and Missing Not At Random (MNAR). Here, we determined the impacts of degree and type of missing data on the accuracy of kNN and RF imputation using two datasets: a targeted metabolomic dataset with spiked-in standards and an untargeted metabolomic dataset. We also assessed the effect of compositional data approaches (CoDA), such as the centered log-ratio (CLR) transform, on data interpretation, since these methods are increasingly being used in metabolomics.
Overall, we found that kNN and RF performed more accurately when the proportion of missing data across samples for a metabolic feature was low. However, these imputations could not handle MNAR data and generated wildly inflated values or imputed values where none should exist. Furthermore, we show that the proportion of missing values had a strong impact on the accuracy of imputation which affected the interpretation of the results. Our results suggest extreme caution should be used with imputation even with modestly levels of missing data or when the type of missingness is unknown.