Prior knowledge informs graph neural networks to improve phenotype prediction from proteomics
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
High-throughput proteomics data provides dense individual-level molecular readouts, enabling the development of machine learning models for predicting diverse phenotypes relevant to patient health. Proteins interact in the cell in complex, nonlinear relationships that may not be reflected by linear models or simple machine learning approaches, highlighting the potential for more expressive deep neural networks to improve performance. Despite this possibility, in practice, developing neural network approaches in biological domains has been a significant challenge. We developed a deep learning framework for predicting disease-related traits from protein expression data using an innovative model architecture designed to exploit structured biological knowledge. The core of the model is a graph neural network (GNN) operating on bipartite graphs where one set of nodes represents protein expression levels and the other represents hundreds of protein sets derived from gene ontology libraries. Edges encode set membership, providing a compact and biologically meaningful structure. We trained our model using the UK Biobank plasma proteomics and individual phenotype data. Of the architectures we examined, the best-performing architecture had three parallel heads: two GNNs each using graphs constructed with independent protein set libraries and one global head consisting of tabular protein expression data. Their outputs are concatenated and passed through a dense feed-forward network to predict phenotype. When applied to predicting glycated hemoglobin (HbA1c) levels and a range of other phenotypes, our model showed strong predictive performance, outperforming other deep learning architectures and simpler linear models. Control models with permuted protein labels displayed worse performance demonstrating that the model benefits from the inductive bias from incorporating prior knowledge, especially in settings with limited training data. We present an innovative model architecture incorporating biological domain knowledge to predict complex traits from large scale proteomic data.