BindPred: A Framework for Predicting Protein-Protein Binding Affinity from Language Model Embeddings

Haixing Piao
Veda Sheersh Boorla
Somtirtha Santra
Costas D. Maranas

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Motivation

Reliable predictions of protein–protein binding affinities are essential for molecular biology and therapeutic discovery. However, most computational methods rely on three-dimensional structural models, which are often unavailable for many complexes.

Results

We introduce BindPred, a structure-agnostic input framework that predicts affinities directly from amino acid sequences by combining embeddings from large protein language models with gradient boosting trees. On the PPB-Affinity benchmark, which comprises 11,919 diverse complexes, BindPred achieves a Pearson correlation coefficient of 0.86 in random split five-fold cross-validation, where the training and test sets share <30% global sequence identity. Ablation analysis indicates that evolutionary embeddings alone capture most of the predictive signals, while augmenting with physics-based energy terms from PyRosetta and BindCraft increases the correlation by only 0.01. A more stringent protein-level split that places entire protein families (wild-type and all mutants) exclusively in either training or testing sets, results in only a modest decline in performance, demonstrating robust generalization to novel interaction pairs. Because BindPred operates exclusively on sequence input, it enables rapid inference (approximately 3 million complexes per GPU (T4) hour), making proteome-scale screening computationally feasible.

Availability

The pretrained model and inference pipeline are available in a Google Colab notebook: BindPred Colab notebook . The training dataset, code, and model weights are available on the Hugging Face: BindPred

Contact

costas@psu.edu

Supplementary information

Supplementary data are available at Bioinformatics online.

Version published to 10.1101/2025.09.27.678407 on bioRxiv
Sep 29, 2025

Protein Dimension DB: A Unified Protein Repository for Representation Learning and Functional Analysis

This article has 3 authors:
1. Pitagoras de Azevedo Alves Sobrinho
2. Tetsu Sakamoto
3. Wilfredo Blanco Figuerola
This article has no evaluationsLatest version Oct 1, 2025
ProStab: Prediction of protein stability change upon mutations by protein language and inverse folding models

This article has 11 authors:
1. Hong Tan
2. Xiaowei Wei
3. Shenggeng Lin
4. Xueying Mao
5. Junwei Chen
6. Heqi Sun
7. Yufang Zhang
8. Zhenghong Zhou
9. Dong-Qing Wei
10. Shuangjun Lin
11. Yi Xiong
This article has no evaluationsLatest version Aug 15, 2025
CoBRA: Compound Binding Site Prediction using RNA Language Model

This article has 2 authors:
1. Wonkyeong Jang
2. Woong-Hee Shin
This article has no evaluationsLatest version Sep 6, 2025

Discuss this preprint

Listed in

Abstract

Motivation

Results

Availability

Contact

Supplementary information

Article activity feed

Related articles

Protein Dimension DB: A Unified Protein Repository for Representation Learning and Functional Analysis

ProStab: Prediction of protein stability change upon mutations by protein language and inverse folding models

CoBRA: Compound Binding Site Prediction using RNA Language Model