A unified framework for batch correction and missing data handling in large-scale and single-cell mass spectrometry proteomics

Ali Mostafa Anwar
Salma Bayoumi
Leo Lahti
Eleanor Coffey

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large-scale mass spectrometry (MS)-based proteomics, including single-cell proteomics, is routinely affected by technical variation arising from discrete batch effects, inter-laboratory differences and continuous signal drift during data acquisition. Current correction strategies typically address these sources of unwanted variation independently and often require either removal of proteins with missing values or imputation before correction, both of which may lead to information loss and potential amplification of technical bias. Here we present NMFBatch, a unified statistical framework that simultaneously models discrete and continuous unwanted variation in bulk and single-cell proteomics data. NMFBatch integrates non-negative matrix factorization with generalized additive modelling and directly accommodates missing values, thereby enabling both on-the-fly imputation during correction and optional post-correction imputation. Benchmarking against six batch-correction methods using multi-laboratory reference datasets and a large plasma proteomics cohort, shows that NMFBatch consistently reduces batch-associated variation while preserving biological structure under both balanced and confounded experimental designs. Application to single-cell proteomics data further showed effective reduction of TMT- and acquisition-associated variation while retaining biologically meaningful clustering. Together, these results establish NMFBatch as a flexible framework for modelling unwanted variation in proteomics experiments, with potential applications in cross-cohort harmonization and integrative proteomics analysis.

Graphical Abstract

Created in BioRender. Youssef, A. (2026) https://BioRender.com/c1q1yxt

Version published to 10.64898/2026.05.19.726178 on bioRxiv
May 21, 2026

BatchVaria: a variance-aware framework for evaluating batch correction in high-dimensional omics data

This article has 3 authors:
1. Nicholas Moir
2. Kitty Sherwood
3. T. Ian Simpson
This article has no evaluationsLatest version May 12, 2026
reComBat-seq: Regularized negative binomial regression for batch-effect correction in underdetermined transcriptomics datasets

This article has 3 authors:
1. Zhasmina Stoyanova
2. Jörg Menche
3. Daniel Malzl
This article has no evaluationsLatest version May 30, 2026
JUMPlion improves quantitative DIA proteomics through ion-level recovery of missing values

This article has 7 authors:
1. Yingxue Fu
2. Zuo-Fei Yuan
3. Stephanie D. Byrum
4. Long Wu
5. Junmin Peng
6. Xusheng Wang
7. Anthony A. High
This article has no evaluationsLatest version May 1, 2026

Discuss this preprint

Listed in

Abstract

Graphical Abstract

Article activity feed

Related articles

BatchVaria: a variance-aware framework for evaluating batch correction in high-dimensional omics data

reComBat-seq: Regularized negative binomial regression for batch-effect correction in underdetermined transcriptomics datasets

JUMPlion improves quantitative DIA proteomics through ion-level recovery of missing values