Illuminating the Druggable Proteome with an AI Protein Profiling Platform
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Creating a ligandable atlas for the proteome would transform our understanding of protein functions and accelerate therapeutic discovery; however, proteomic approaches are constrained by insufficient proteome coverage and data heterogeneity, while existing machine learning (ML) models have limited power due to structural dependencies and heterogeneous experimental labels. Here we developed AiPP, a multimodal AI platform that predicts and characterizes ligand interaction sites directly from protein sequence. AiPP is powered by the evolutionary-scale protein large language models (LLMs) and leverages two harmonized ML training sets derived from the new databases comprising cysteine ligandability from activity-based protein profiling (ABPP) studies and reversible binding evidenced from co-crystal structures. We developed a LLM representation based clustering framework to interrogate, reconcile, and augment experimental labels in both databases. Two complementary protocols were implemented to iteratively expand the training data while improving model performance. Although trained exclusively on ABPP data, AiPP recovers 80% (Top-1) of cysteine liganding events from cocrystal structures, with 84% AUPRC and 89% AUROC. AiPP recapitulates consistently and heterogeneously liganded cysteines across cancer cell lines and reliably identifies dynamic, ligandable pockets in ``undruggable'' transcription factors. Remarkably, AiPP accurately predicts active-site and allosteric cysteines in protein tyrosine phosphatases that were undetected by ABPP. Finally, we applied AiPP to the entire human proteome, identifying ligandable sites in proteins that were undetected or unliganded by ABPP, including an allosteric site in MC3R, which is a therapeutic target for treatment of eating disorder and obesity. This proteomewide covalent ligandability atlas (version 1.0) is anticipated to guide future development of chemical probes and pharmaceutical modulators, particularly for understudied proteins and currently undruggable targets. The LLM-based approach to interrogate large-scale heterogeneous data is broadly applicable to protein research and development of proteomics-derived ML models for diverse applications.