ProteomeLM: A proteome-scale language model enables accurate and rapid prediction of protein–protein interactions and gene essentiality across taxa

Cyril Malbranke
Gionata Paolo Zalaffi
Anne-Florence Bitbol

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (Arcadia Science)

Abstract

Language models trained on biological sequences are advancing inference tasks from the scale of single proteins to that of genomic neighborhoods. Here, we introduce ProteomeLM, a transformer-based language model that uniquely operates on entire proteomes from species spanning the tree of life. ProteomeLM is trained to reconstruct masked protein embeddings using the whole proteomic context, yielding contextualized protein representations that reflect proteome-scale functional constraints. Notably, ProteomeLM’s attention coefficients encode protein–protein interactions (PPI), despite being trained without interaction labels. Furthermore, it enables interactome-wide PPI screening that is substantially more accurate, and orders of magnitude faster, than amino acid coevolution-based methods. We further develop ProteomeLM-PPI, a supervised model that combines ProteomeLM embeddings and attention coefficients to achieve state-of-the-art PPI prediction across benchmarks and species. Finally, we introduce ProteomeLM-Ess, a supervised gene essentiality predictor that generalizes across diverse taxa. Our results demonstrate the potential of proteome-scale language models for addressing function and interactions at the organism level.

Version published to 10.1073/pnas.2524201123
May 20, 2026
Arcadia Science
Mar 18, 2026

Thus, our finding supports the notion that higher-order interactions, rather than simple local features, are essential for understanding PPI, and that ProteomeLM can extract such complex biological signals.

That is an intuitive explanation, but is this idea - that later layers are capturing higher-order interactions (i.e. the order of interaction "recovered/learned" by the attention is a function of layer depth) actually tested?

Read the original source
Arcadia Science
Mar 18, 2026

Thus, ProteomeLM can identify PPI among vast numbers of possible protein pairs in a complete proteome (e.g., ∼ 4,000 proteins in E. coli and ∼ 20,000 in humans, leading to ∼ 8 × 106 and ∼ 2 × 108 possible pairs, respectively) in an unsupervised manner, without any fine-tuning. This is especially compelling given that ProteomeLM does not rely on gene order or local genomic context. The learning of PPI arises directly from the masked prediction training, which promotes the learning of dependencies between proteins in a proteome.

Have you looked at using metrics other than AUC that can account for extreme class imbalance (e.g. AUC-PR)? PPIs (the "positive class" in this case) are exceedingly rare - a circumstance in which AUC can be misleadingly optimistic, as the attention coefficients may be successfully "rejecting" many of the true …

Thus, ProteomeLM can identify PPI among vast numbers of possible protein pairs in a complete proteome (e.g., ∼ 4,000 proteins in E. coli and ∼ 20,000 in humans, leading to ∼ 8 × 106 and ∼ 2 × 108 possible pairs, respectively) in an unsupervised manner, without any fine-tuning. This is especially compelling given that ProteomeLM does not rely on gene order or local genomic context. The learning of PPI arises directly from the masked prediction training, which promotes the learning of dependencies between proteins in a proteome.

Have you looked at using metrics other than AUC that can account for extreme class imbalance (e.g. AUC-PR)? PPIs (the "positive class" in this case) are exceedingly rare - a circumstance in which AUC can be misleadingly optimistic, as the attention coefficients may be successfully "rejecting" many of the true negatives, while still being very poor at identifying actual cases of PPIs.

Read the original source
Version published to 10.1101/2025.08.01.668221 on bioRxiv
Aug 3, 2025

Predicting host-pathogen interactions using a proteome-scale language model

This article has 3 authors:
1. Cyril Malbranke
2. Cecilia Fruet
3. Anne-Florence Bitbol
This article has no evaluationsLatest version May 31, 2026
Task-Specialized Protein Language Models Decode the Sequence Grammar of Post-Translational Modification Sites

This article has 2 authors:
1. Subinoy Adhikari
2. Jagannath Mondal
This article has no evaluationsLatest version May 12, 2026
Evolutionary constraints improve protein large language model predictions for protein stability, binding regions and epistasis

This article has 3 authors:
1. Konstantina Tzavella
2. Catharina Olsen
3. Wim Vranken
This article has no evaluationsLatest version May 26, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Predicting host-pathogen interactions using a proteome-scale language model

Task-Specialized Protein Language Models Decode the Sequence Grammar of Post-Translational Modification Sites

Evolutionary constraints improve protein large language model predictions for protein stability, binding regions and epistasis