Using regularized regression and biological covariation to impute missing values in quantitative proteomics

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Proteomics studies analyzing many samples typically generate datasets with missing values where many protein abundances are only quantified in a subset of assayed conditions. While multiplexing with isobaric tags can address this by combining multiple samples into a single injection, missing values are unavoidable when the sample count exceeds the number of available isobaric tags (currently >35). Such missing data complicates the interpretation of large-scale studies across diverse experimental conditions. Here, we introduce a method to impute missing values of relative protein abundance by leveraging measurements from other proteins in the dataset through regularized regression. Our technique, which is applicable to diverse datasets including different cell lines, animals, or biochemical perturbations, capitalizes on the hitherto overlooked biological covariation among protein abundance changes. Our analysis of eight published proteomics datasets reveals a robust imputation capability, achieving a median R 2 of 0.55 to 0.8 between imputed and measured data. We demonstrate a similar imputation efficacy in multiple measurement modalities: TMT, DIA, label free, and TMT phosphoproteomics. When examining regression coefficients that were pivotal for accurate data imputation we found that those often mirror known biology. We propose that previously overlooked biological covariation might lead to the generation of novel hypotheses and ultimately advance our understanding of systems level protein organization.

Article activity feed