Zero-Shot Protein-Ligand Binding Site Prediction from Protein Sequence and SMILES
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Accurate identification of protein-ligand binding sites is critical for mechanistic biology and drug discovery, yet performance varies widely across ligand families and data regimes. We present a systematic prediction and evaluation framework that stratifies ligands into three settings: overrepresented (many examples), underrepresented (tens of examples; few-shot), and zero-shot (unseen at training). We developed a novel three-stage, sequence-based modeling suite that progressively adds ligand conditioning and zero-shot capability, and used an evaluation framework to assess the suite. Stage 1 trains per-ligand predictors using a pretrained protein language model (PLM). Stage 2 introduces ligand-aware conditioning via an embedding table, enabling a single multi-ligand model. Stage 3 replaces the table with a pretrained chemical language model (CLM) operating on SMILES, enabling zero-shot generalization. We show Stage 2 improves Macro F1 on the overrepresented test set from 0.4769 (Stage 1) to 0.5832 and outperforms sequence- and structure-based baselines. Stage 3 attains zero-shot performance (F1 = 0.3109) on 5,612 previously unseen ligands while remaining competitive on represented ligands. Ablations across five PLM scales and multiple CLMs reveal larger PLM backbones consistently increase Macro F1 across all regimes, whereas scaling the CLM yields modest or inconsistent gains, which need further investigation. Our results demonstrate that zero-shot residue-level prediction from sequence and SMILES is feasible and identify the PLM scale as the dominant lever for further advances. The code is fully open source at GitHub: https://github.com/mahdip72/ProteinLigand