ProCyon: A multimodal foundation model for protein phenotypes
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Characterizing human proteins remains a major challenge: approximately 29% of human proteins lack experimentally validated functions and even well-annotated proteins often lack context-specific phenotypic insights. To enable universal modeling of protein phenotypes, we present P ro C yon , a multimodal foundation model that utilizes protein sequence, structure, and natural language for generating and predicting protein phenotypes across diverse knowledge domains. P ro C yon is trained on our novel dataset, P ro C yon -I nstruct , with 33 million protein phenotype instructions. On dozens of benchmarking tasks, P ro C yon performs competitively against single-modal and multimodal models. Further, P ro C yon conditionally retrieves proteins via mechanisms of action of small molecule drugs and disease contexts, and it generates candidate phenotypic descriptions for poorly characterized proteins, including those implicated in Parkinson’s disease that were identified after P ro C yon ’s knowledge cutoff date. We experimentally confirm P ro C yon ’s predictions in multiple sclerosis using post-mortem brain RNA-seq, identifying novel MS genes and elucidating associated pathway mechanisms consistent with cortical pathology. P ro C yon paves the way toward a general approach to generate functional insights into the human proteome.