ProteomeLM: A proteome-scale language model enables accurate and rapid prediction of protein–protein interactions and gene essentiality across taxa
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (Arcadia Science)
Abstract
Language models trained on biological sequences are advancing inference tasks from the scale of single proteins to that of genomic neighborhoods. Here, we introduce ProteomeLM, a transformer-based language model that uniquely operates on entire proteomes from species spanning the tree of life. ProteomeLM is trained to reconstruct masked protein embeddings using the whole proteomic context, yielding contextualized protein representations that reflect proteome-scale functional constraints. Notably, ProteomeLM’s attention coefficients encode protein–protein interactions (PPI), despite being trained without interaction labels. Furthermore, it enables interactome-wide PPI screening that is substantially more accurate, and orders of magnitude faster, than amino acid coevolution-based methods. We further develop ProteomeLM-PPI, a supervised model that combines ProteomeLM embeddings and attention coefficients to achieve state-of-the-art PPI prediction across benchmarks and species. Finally, we introduce ProteomeLM-Ess, a supervised gene essentiality predictor that generalizes across diverse taxa. Our results demonstrate the potential of proteome-scale language models for addressing function and interactions at the organism level.
Article activity feed
-
-
Thus, our finding supports the notion that higher-order interactions, rather than simple local features, are essential for understanding PPI, and that ProteomeLM can extract such complex biological signals.
That is an intuitive explanation, but is this idea - that later layers are capturing higher-order interactions (i.e. the order of interaction "recovered/learned" by the attention is a function of layer depth) actually tested?
-
Thus, ProteomeLM can identify PPI among vast numbers of possible protein pairs in a complete proteome (e.g., ∼ 4,000 proteins in E. coli and ∼ 20,000 in humans, leading to ∼ 8 × 106 and ∼ 2 × 108 possible pairs, respectively) in an unsupervised manner, without any fine-tuning. This is especially compelling given that ProteomeLM does not rely on gene order or local genomic context. The learning of PPI arises directly from the masked prediction training, which promotes the learning of dependencies between proteins in a proteome.
Have you looked at using metrics other than AUC that can account for extreme class imbalance (e.g. AUC-PR)? PPIs (the "positive class" in this case) are exceedingly rare - a circumstance in which AUC can be misleadingly optimistic, as the attention coefficients may be successfully "rejecting" many of the true …
Thus, ProteomeLM can identify PPI among vast numbers of possible protein pairs in a complete proteome (e.g., ∼ 4,000 proteins in E. coli and ∼ 20,000 in humans, leading to ∼ 8 × 106 and ∼ 2 × 108 possible pairs, respectively) in an unsupervised manner, without any fine-tuning. This is especially compelling given that ProteomeLM does not rely on gene order or local genomic context. The learning of PPI arises directly from the masked prediction training, which promotes the learning of dependencies between proteins in a proteome.
Have you looked at using metrics other than AUC that can account for extreme class imbalance (e.g. AUC-PR)? PPIs (the "positive class" in this case) are exceedingly rare - a circumstance in which AUC can be misleadingly optimistic, as the attention coefficients may be successfully "rejecting" many of the true negatives, while still being very poor at identifying actual cases of PPIs.
-