Prior knowledge informs graph neural networks to improve phenotype prediction from proteomics

Prabuddha Ghosh Dastidar
Gus Fridell
Joshua M. Popp
Marios Arvanitis
Alexis Battle

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

High-throughput proteomics data provides dense individual-level molecular readouts, enabling the development of machine learning models for predicting diverse phenotypes relevant to patient health. Proteins interact in the cell in complex, nonlinear relationships that may not be reflected by linear models or simple machine learning approaches, highlighting the potential for more expressive deep neural networks to improve performance. Despite this possibility, in practice, developing neural network approaches in biological domains has been a significant challenge. We developed a deep learning framework for predicting disease-related traits from protein expression data using an innovative model architecture designed to exploit structured biological knowledge. The core of the model is a graph neural network (GNN) operating on bipartite graphs where one set of nodes represents protein expression levels and the other represents hundreds of protein sets derived from gene ontology libraries. Edges encode set membership, providing a compact and biologically meaningful structure. We trained our model using the UK Biobank plasma proteomics and individual phenotype data. Of the architectures we examined, the best-performing architecture had three parallel heads: two GNNs each using graphs constructed with independent protein set libraries and one global head consisting of tabular protein expression data. Their outputs are concatenated and passed through a dense feed-forward network to predict phenotype. When applied to predicting glycated hemoglobin (HbA1c) levels and a range of other phenotypes, our model showed strong predictive performance, outperforming other deep learning architectures and simpler linear models. Control models with permuted protein labels displayed worse performance demonstrating that the model benefits from the inductive bias from incorporating prior knowledge, especially in settings with limited training data. We present an innovative model architecture incorporating biological domain knowledge to predict complex traits from large scale proteomic data.

Version published to 10.1101/2025.11.23.25340814 on medRxiv
Nov 25, 2025

Uncovering miRNA–Disease Associations Through Graph Based Neural Network Representations

This article has 1 author:
1. Alessandro Orro
This article has no evaluationsLatest version Jan 28, 2026
PRESSnet: a novel framework for patient stratification and biomarker discovery using clinical knowledge graphs

This article has 11 authors:
1. Jake Cohen-Setton
2. Shruti Shikhare
3. Ioannis Kagiampakis
4. Domingo Salazar
5. Miguel Goncalves
6. Elizabeth Coker
7. Sanddhya Jayabalan
8. Damian Bikiel
9. Ben Sidders
10. Etai Jacob
11. Krishna Bulusu
This article has no evaluationsLatest version Dec 15, 2025
Transcriptome Graph Transformer--A Graph Transformer-Based Unsupervised Model for Transcriptome Data Analysis

This article has 3 authors:
1. Teng Long
2. Sachit Satyal
3. Jean Gao
This article has no evaluationsLatest version Jan 9, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Uncovering miRNA–Disease Associations Through Graph Based Neural Network Representations

PRESSnet: a novel framework for patient stratification and biomarker discovery using clinical knowledge graphs

Transcriptome Graph Transformer--A Graph Transformer-Based Unsupervised Model for Transcriptome Data Analysis