To impute or not to impute in untargeted metabolomics - that is the compositional question

Dennis Dimitri Krutkin
Sydney Thomas
Simone Zuffa
Prajit Rajkumar
Rob Knight
Pieter C. Dorrestein
Scott T. Kelley

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Untargeted metabolomics often produce large datasets with missing values, arising from biological or technical factors, which can undermine statistical analyses and lead to biased biological interpretations. Imputation methods, such as k-Nearest Neighbors (kNN) and Random Forest (RF) regression are commonly used but their effects vary depending on the type of missing data e.g. Missing Completely At Random (MCAR) and Missing Not At Random (MNAR). Here, we determined the impacts of degree and type of missing data on the accuracy of kNN and RF imputation using two datasets: a targeted metabolomic dataset with spiked-in standards and an untargeted metabolomic dataset. We also assessed the effect of compositional data approaches (CoDA), such as the centered log-ratio (CLR) transform, on data interpretation, since these methods are increasingly being used in metabolomics.

Overall, we found that kNN and RF performed more accurately when the proportion of missing data across samples for a metabolic feature was low. However, these imputations could not handle MNAR data and generated wildly inflated values or imputed values where none should exist. Furthermore, we show that the proportion of missing values had a strong impact on the accuracy of imputation which affected the interpretation of the results. Our results suggest extreme caution should be used with imputation even with modestly levels of missing data or when the type of missingness is unknown.

Version published to 10.1101/2024.10.28.620738 on bioRxiv
Nov 2, 2024

Missing Data in OHCA Registries: How Multiple Imputation Methods Affect Research Conclusions—Paper II

This article has 4 authors:
1. Stella Jinran Zhan
2. Seyed Ehsan Saffari
3. Marcus Eng Hock Ong
4. Fahad Javaid Siddiqui
This article has no evaluationsLatest version Jan 16, 2026
Evaluating Imputation Methods for Handling Missing Data in Complex Survey Designs: Evidence from the India DHS 2017–18

This article has 6 authors:
1. Mahfuzer Rohman
2. Md Sabbir Hossain
3. Md Fakrul Islam
4. Prosenjit Basak Arka
5. Md Rafi Hasan
6. Md Jamal Uddin
This article has no evaluationsLatest version Jan 23, 2026
Bayesian Network Structure Learning from Incomplete Breast Cancer Data Using Structural Expectation–Maximization

This article has 3 authors:
1. Navaee Lavasani Monireh
2. Rezaeitabar Vahid
3. Khayamzadeh Maryam
This article has no evaluationsLatest version Dec 10, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Missing Data in OHCA Registries: How Multiple Imputation Methods Affect Research Conclusions—Paper II

Evaluating Imputation Methods for Handling Missing Data in Complex Survey Designs: Evidence from the India DHS 2017–18

Bayesian Network Structure Learning from Incomplete Breast Cancer Data Using Structural Expectation–Maximization