Predicting host-pathogen interactions using a proteome-scale language model

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

ProteomeLM (Malbranke et al., 2025) is a proteome-scale language model trained on proteomes spanning the tree of life to reconstruct masked protein embeddings from proteome context within each species. Its attention coefficients capture protein-protein interactions without supervision. Here, we show that this capability extends to cross-species host-pathogen interactions (HPI) across ten human pathogen taxa spanning viruses and bacteria, and can be further improved with lightweight fine-tuning. We introduce ProteomeLM-HPI , a parameter-efficient adaptation via LoRA, trained on concatenated host-pathogen proteomes to reconstruct masked pathogen embeddings from host context. ProteomeLM-HPI involves two key design choices: asymmetric masking (pathogen-heavy masking) and blocked self-attention . Systematic ablations show that both choices contribute. To assess generalization, we introduce a strict cross-species benchmark enforcing pathogen-level hold-out and 40% sequence-identity filtering. On this benchmark, Proteome-HPI improves AUC on 9 out of 10 unseen pathogens.

Article activity feed