A unified framework for batch correction and missing data handling in large-scale and single-cell mass spectrometry proteomics
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large-scale mass spectrometry (MS)-based proteomics, including single-cell proteomics, is routinely affected by technical variation arising from discrete batch effects, inter-laboratory differences and continuous signal drift during data acquisition. Current correction strategies typically address these sources of unwanted variation independently and often require either removal of proteins with missing values or imputation before correction, both of which may lead to information loss and potential amplification of technical bias. Here we present NMFBatch, a unified statistical framework that simultaneously models discrete and continuous unwanted variation in bulk and single-cell proteomics data. NMFBatch integrates non-negative matrix factorization with generalized additive modelling and directly accommodates missing values, thereby enabling both on-the-fly imputation during correction and optional post-correction imputation. Benchmarking against six batch-correction methods using multi-laboratory reference datasets and a large plasma proteomics cohort, shows that NMFBatch consistently reduces batch-associated variation while preserving biological structure under both balanced and confounded experimental designs. Application to single-cell proteomics data further showed effective reduction of TMT- and acquisition-associated variation while retaining biologically meaningful clustering. Together, these results establish NMFBatch as a flexible framework for modelling unwanted variation in proteomics experiments, with potential applications in cross-cohort harmonization and integrative proteomics analysis.
Graphical Abstract
Created in BioRender. Youssef, A. (2026) https://BioRender.com/c1q1yxt