Using Autoregressive-Transformer Model for Protein-Ligand Binding Site Prediction
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Accurate prediction of protein-ligand binding sites is critical for understanding molecular interactions and advancing drug discovery. Existing computational approaches often suffer from limited generality, restricting their applicability to a small subset of ligands, while data scarcity further impairs performance, particularly for underrepresented ligand types. To address these challenges, we introduce a unified model that integrates a protein language model with an autoregressive transformer for protein-ligand binding site prediction. By framing the task as a language modeling problem and incorporating task-specific tokens, our method achieves broad ligand coverage while relying solely on protein sequence input. We systematically analyze ligand-specific task token embeddings, demonstrating that they capture meaningful biochemical properties through clustering and correlation analyses. Furthermore, our multi-task learning strategy enables effective knowledge transfer across ligands, significantly improving predictions for those with limited training data. Experimental evaluations on 41 ligands highlight the model’s superior generalization and applicability compared to existing methods. This work establishes a scalable generative AI framework for binding site prediction, laying the foundation for future extensions incorporating structural information and richer ligand representations. The code, model, and datasets are available at this link.