ProCyon: A multimodal foundation model for protein phenotypes
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Understanding the roles of human proteins remains a major challenge, with approximately 20% of human proteins lacking known functions and more than 40% missing context-specific functional insights. Even well-annotated proteins are often poorly characterized in diverse biological contexts, disease states, and perturbations. We present P ro C yon , a foundation model for modeling, generating, and predicting protein phenotypes across five interrelated knowledge domains: molecular functions, therapeutic mechanisms, disease associations, functional protein domains, and molecular interactions. To support this, we created P ro C yon -INSTRUCT, a dataset of 33 million protein phenotype instructions, representing a comprehensive resource for multiscale protein phenotypes. By co-training a large language model with multimodal molecular encoders, P ro C yon integrates phenotypic and protein data. A novel architecture and instruction tuning strategy allow P ro C yon to process arbitrarily interleaved protein-and-phenotype inputs, achieve zero-shot task transfer, and generate free- form text phenotypes interleaved with retrieved protein sequence, structure, and drug modalities in a single unified model. P ro C yon achieves strong performance against single-modality models, multimodal models such as ESM3, as well as text-only LLMs on dozens of benchmarking tasks such as contextual protein retrieval and question answering. We extensively evaluate P ro C yon for biological applications, including identifying protein domains that bind small molecule drugs, predicting peptide binding with enzymes, and assessing the functional impact of Alzheimer’s disease mutations. P ro C yon enables conditional retrieval of proteins linked to small molecules through complementary mechanisms of action. It generates candidate phenotypes for under-characterized proteins recently implicated in Parkinson’s disease, facilitating hypothesis generation for poorly understood proteins and biological processes. P ro C yon paves the way toward an effective, general solution for functional