Protein language model embeddings enable proteome-wide discovery of plant defense gene networks across species
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Identifying the full complement of defense genes across plant proteomes remains challenging, particularly for species with incomplete functional annotations. Here we present PlantDefenseESM, a computational pipeline that leverages protein language model embeddings to discover defense gene networks at the proteome scale without requiring species-specific training or curated gene ontology databases. We generated 1,280-dimensional embeddings for all proteins in the proteomes of Arabidopsis thaliana (48,207 proteins), Oryza sativa (42,575), and Vitis vinifera (40,632) using ESM-2, a transformer-based model pre-trained on 250 million protein sequences. Defense candidates were identified by cosine similarity to category centroids defined by 33 experimentally validated anchor proteins spanning six functional classes: NBS-LRR resistance proteins, pathogenesis-related proteins, receptor-like kinases, defense signaling components, antimicrobial enzymes, and hypersensitive response regulators. A multi-tier selection strategy combining percentile-based and rank-based approaches identified 2,807, 2,442, and 2,354 moderate-tier candidates in A. thaliana, O. sativa, and V. vinifera, respectively. Independent validation against RefSeq functional annotations confirmed 3.35–4.22-fold enrichment of defense-annotated proteins among candidates (Fisher's exact test, p < 10⁻¹⁹⁹ in all species). Notably, 55–59% of candidates across all three species lacked any existing defense annotation, representing putative novel defense genes. Cross-species comparison revealed a conserved category hierarchy with lineage-specific expansions consistent with known biology, including expanded cell death machinery in grapevine and receptor-like kinase families in rice. The pipeline is species-agnostic, requires only a reference proteome as input, and provides a scalable framework for defense gene discovery in any sequenced plant genome.