BindPred: A Framework for Predicting Protein-Protein Binding Affinity from Language Model Embeddings
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Motivation
Reliable predictions of protein–protein binding affinities are essential for molecular biology and therapeutic discovery. However, most computational methods rely on three-dimensional structural models, which are often unavailable for many complexes.
Results
We introduce BindPred, a structure-agnostic input framework that predicts affinities directly from amino acid sequences by combining embeddings from large protein language models with gradient boosting trees. On the PPB-Affinity benchmark, which comprises 11,919 diverse complexes, BindPred achieves a Pearson correlation coefficient of 0.86 in random split five-fold cross-validation, where the training and test sets share <30% global sequence identity. Ablation analysis indicates that evolutionary embeddings alone capture most of the predictive signals, while augmenting with physics-based energy terms from PyRosetta and BindCraft increases the correlation by only 0.01. A more stringent protein-level split that places entire protein families (wild-type and all mutants) exclusively in either training or testing sets, results in only a modest decline in performance, demonstrating robust generalization to novel interaction pairs. Because BindPred operates exclusively on sequence input, it enables rapid inference (approximately 3 million complexes per GPU (T4) hour), making proteome-scale screening computationally feasible.
Availability
The pretrained model and inference pipeline are available in a Google Colab notebook: BindPred Colab notebook . The training dataset, code, and model weights are available on the Hugging Face: BindPred
Contact
costas@psu.edu
Supplementary information
Supplementary data are available at Bioinformatics online.