Protein Compositional Ratio Representation (PCRR) Systematically Improves Human Disease Prediction

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Plasma proteomics captures a functional snapshot of human physiology; yet, most machine learning models treat protein abundances as independent variables, ignoring the fact that biological systems and proteomic measurements are inherently compositional. Many molecular processes depend not on absolute concentrations but on relative balances: receptor–ligand stoichiometry, enzyme–substrate ratios, and homeostatic feedbacks that govern signaling and metabolism. We propose that these relationships are best captured through pairwise protein ratios, which more faithfully reflect underlying biochemical constraints than raw expression values.

We evaluate a machine learning framework that models pairwise log-ratios of proteins ( log ( A ) − log ( B )) as features, thereby encoding compositional structure directly into the learning space. Applied to the ROSMAP plasma proteomics cohort (n = 871), this approach substantially improved the classification of Alzheimer’s subtypes (NCI, MCI, AD, AD+) with an average AUROC gain of +0.0995 over a strong baseline that incorporated raw proteomics and demographics. The top-ranked ratios (e.g., APOC1:ARID1A, FGF7:LBP) captured converging pathogenic pillars of Alzheimer’s disease, including microglial activation, proteostasis dysregulation, and lipid-clearance imbalance, highlighting that ratio-based features recover biologically coherent axes of disease.

To assess generality, we scaled the method to the UK Biobank proteomic dataset (n > 53,000; 587 phenotypes). The ratio-based model outperformed raw-level models in 95.1 % of diseases, with statistically significant (FDR < 0.05) gains in 56.7 %. Together, these results suggest that proteomic data should be viewed and modeled as compositional systems, where relative protein abundances carry the accurate functional signal. This insight reframes how future proteomic studies, and potentially all omics, should represent molecular features for disease prediction and biomarker discovery.

Article activity feed