Zero-Shot Protein-Ligand Binding Site Prediction from Protein Sequence and SMILES

Mahdi Pourmirzaei
Salhuldin Alqarghuli
Kai Chen
Mohammadreza Pourmirzaei
Dong Xu

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Accurate identification of protein-ligand binding sites is critical for mechanistic biology and drug discovery, yet performance varies widely across ligand families and data regimes. We present a systematic prediction and evaluation framework that stratifies ligands into three settings: overrepresented (many examples), underrepresented (tens of examples; few-shot), and zero-shot (unseen at training). We developed a novel three-stage, sequence-based modeling suite that progressively adds ligand conditioning and zero-shot capability, and used an evaluation framework to assess the suite. Stage 1 trains per-ligand predictors using a pretrained protein language model (PLM). Stage 2 introduces ligand-aware conditioning via an embedding table, enabling a single multi-ligand model. Stage 3 replaces the table with a pretrained chemical language model (CLM) operating on SMILES, enabling zero-shot generalization. We show Stage 2 improves Macro F1 on the overrepresented test set from 0.4769 (Stage 1) to 0.5832 and outperforms sequence- and structure-based baselines. Stage 3 attains zero-shot performance (F1 = 0.3109) on 5,612 previously unseen ligands while remaining competitive on represented ligands. Ablations across five PLM scales and multiple CLMs reveal larger PLM backbones consistently increase Macro F1 across all regimes, whereas scaling the CLM yields modest or inconsistent gains, which need further investigation. Our results demonstrate that zero-shot residue-level prediction from sequence and SMILES is feasible and identify the PLM scale as the dominant lever for further advances. The code is fully open source at GitHub: https://github.com/mahdip72/ProteinLigand

Version published to 10.1101/2025.09.28.679103 on bioRxiv
Sep 30, 2025

Illuminating the Druggable Human Proteome with an AI Protein Profiling Platform

This article has 6 authors:
1. Guy W. Dayhoff
2. Daniel Kortzak
3. Ruibin Liu
4. Mingzhe Shen
5. Zhong-Yin Zhang
6. Jana Shen
This article has no evaluationsLatest version Sep 8, 2025
BindPred: A Framework for Predicting Protein-Protein Binding Affinity from Language Model Embeddings

This article has 4 authors:
1. Haixing Piao
2. Veda Sheersh Boorla
3. Somtirtha Santra
4. Costas D. Maranas
This article has no evaluationsLatest version Sep 29, 2025
GatorAffinity: Boosting Protein-Ligand Binding Affinity Prediction with Large-Scale Synthetic Structural Data

This article has 8 authors:
1. Jinhang Wei
2. Yupu Zhang
3. Peter A Ramdhan
4. Zihang Huang
5. Gustavo Seabra
6. Zhe Jiang
7. Chenglong Li
8. Yanjun Li
This article has no evaluationsLatest version Oct 1, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Illuminating the Druggable Human Proteome with an AI Protein Profiling Platform

BindPred: A Framework for Predicting Protein-Protein Binding Affinity from Language Model Embeddings

GatorAffinity: Boosting Protein-Ligand Binding Affinity Prediction with Large-Scale Synthetic Structural Data