Illuminating the Druggable Proteome with an AI Protein Profiling Platform

Jana Shen
Guy Dayhoff II
Daniel Kortzak
Ruibin Liu
Mingzhe Shen
Zhong-Yin Zhang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Creating a ligandable atlas for the proteome would transform our understanding of protein functions and accelerate therapeutic discovery; however, proteomic approaches are constrained by insufficient proteome coverage and data heterogeneity, while existing machine learning (ML) models have limited power due to structural dependencies and heterogeneous experimental labels. Here we developed AiPP, a multimodal AI platform that predicts and characterizes ligand interaction sites directly from protein sequence. AiPP is powered by the evolutionary-scale protein large language models (LLMs) and leverages two harmonized ML training sets derived from the new databases comprising cysteine ligandability from activity-based protein profiling (ABPP) studies and reversible binding evidenced from co-crystal structures. We developed a LLM representation based clustering framework to interrogate, reconcile, and augment experimental labels in both databases. Two complementary protocols were implemented to iteratively expand the training data while improving model performance. Although trained exclusively on ABPP data, AiPP recovers 80% (Top-1) of cysteine liganding events from cocrystal structures, with 84% AUPRC and 89% AUROC. AiPP recapitulates consistently and heterogeneously liganded cysteines across cancer cell lines and reliably identifies dynamic, ligandable pockets in ``undruggable'' transcription factors. Remarkably, AiPP accurately predicts active-site and allosteric cysteines in protein tyrosine phosphatases that were undetected by ABPP. Finally, we applied AiPP to the entire human proteome, identifying ligandable sites in proteins that were undetected or unliganded by ABPP, including an allosteric site in MC3R, which is a therapeutic target for treatment of eating disorder and obesity. This proteomewide covalent ligandability atlas (version 1.0) is anticipated to guide future development of chemical probes and pharmaceutical modulators, particularly for understudied proteins and currently undruggable targets. The LLM-based approach to interrogate large-scale heterogeneous data is broadly applicable to protein research and development of proteomics-derived ML models for diverse applications.

Version published to 10.21203/rs.3.rs-7667948/v1 on Research Square
Oct 3, 2025

Integrating Evolutionary and Compositional Features with ML and DL for Robust and Interpretable Druggable Protein Prediction

This article has 5 authors:
1. Mujeebu Rehman
2. Qinghua Liu
3. Muhammad Javed
4. Ali Ghulam
5. Teerath Kumar
This article has no evaluationsLatest version Dec 11, 2025
The Deep Core: Mapping the 0.91% Regulatory Backbone of the Human Proteome and Its Role in Cancer Drug Resistance

This article has 1 author:
1. Andres Pirolo
This article has no evaluationsLatest version Feb 4, 2026
Artificial Intelligence–Driven Structural Mining Enables Functional Inference in the Human Dark Proteome

This article has 7 authors:
1. Valentina Carbonari
2. Annamaria Defilippo
3. Ugo Lomoio
4. Caterina Francesca Perri
5. Barbara Puccio
6. Pierangelo Veltri
7. Pietro Hiram Guzzi
This article has no evaluationsLatest version Dec 23, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Integrating Evolutionary and Compositional Features with ML and DL for Robust and Interpretable Druggable Protein Prediction

The Deep Core: Mapping the 0.91% Regulatory Backbone of the Human Proteome and Its Role in Cancer Drug Resistance

Artificial Intelligence–Driven Structural Mining Enables Functional Inference in the Human Dark Proteome