Have protein-ligand co-folding methods moved beyond memorisation?

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Deep learning has driven major breakthroughs in protein structure prediction, however the next critical advance is accurately predicting how proteins interact with other molecules, especially small molecule ligands, to enable real-world applications such as drug discovery and design. Recent deep learning all-atom methods have been built to address this challenge, but evaluating their performance on the prediction of protein-ligand complexes has been inconclusive due to the lack of relevant benchmarking datasets. Here we present a comprehensive evaluation of four leading all-atom cofolding deep learning methods using our newly introduced benchmark dataset Runs N’ Poses, which comprises 2,600 high-resolution protein-ligand systems released after the training cutoff used by these methods. We demonstrate that current co-folding approaches largely memorise ligand poses from their training data, hindering their use for de novo drug design. This limitation is especially pronounced for ligands that have only been seen binding in one pocket, whereas more promiscuous ligands such as cofactors show moderately improved performance. With this work and benchmark dataset, we aim to accelerate progress in the field by allowing for a more realistic assessment of the current state-of-the-art deep learning methods for predicting protein-ligand interactions.

Article activity feed

  1. This Zenodo record is a permanently preserved version of a PREreview. You can view the complete PREreview at https://prereview.org/reviews/15708197.

    Have protein-ligand co-folding methods moved beyond memorisation?

    https://www.biorxiv.org/content/10.1101/2025.02.03.636309v2 

    Peter Škrinjar, Jérôme Eberhardt, Janani Durairaj, Torsten Schwede

    Summary

    The authors present the Runs N' Poses dataset, containing 2,600 high-resolution protein-ligand complexes from the PDB for use as a benchmark for protein-ligand co-folding tools. They categorize the complexes in this dataset on their similarity to ligands and binding pockets in the AlphaFold3 training data. Evaluating AlphaFold3 and the similar models Proteinix, Chai-1, and Boltz-1 on the dataset reveals model accuracy significantly declines on complexes dissimilar to the training data. They find this accuracy decrease occurs even for complexes with small differences in ligand positioning compared to the training set, and the protein binding pocket is generally modelled correctly even when the complex is modelled incorrectly.

    The major success of the paper is compiling this benchmark dataset, then applying it to evaluate the accuracy of AlphaFold3 on out-of-distribution protein-ligand complexes.

    A major weakness of the paper is their limited discussion of how or why AlphaFold3 mispredicts out-of-distribution complexes. The authors show that the binding pocket is predicted accurately even in complexes dissimilar to the training set, but do not continue with examining errors in ligand positioning.

    The Runs N' Poses benchmark presents an important advancement for protein dataset benchmarking. The discovery of AlphaFold3's memorization of protein-ligand complexes is an important message, particularly for researchers attempting to apply the tool for de novo protein design.

    Major points

    • We think it would be useful for the benchmarking (and possible training) of future models to report the following information about the dataset:

      • What is the most recent date for PDB structures to be included in the Runs N' Poses benchmark?

      • Were proteins in the AlphaFold3 validation set excluded from the Runs N' Poses benchmark?

        • The AF3 validation set included structures "after 2021-09-30 and before 2023-01-13" [1], and was used for model selection, meaning some data leakage could occur between what model was selected and the structures tested in this benchmark. AF3 does not appear to exclude the validation set from its evaluation set, and overall mitigating this leakage is challenging given the scarcity of data. We believe it would still be useful to remark whether this was considered in constructing the benchmark, and to perhaps provide a subset of structures that are not present in the AF3 validation set.

      • How many proteins were excluded at each step of processing/filtering the dataset?

      • What are the reasons behind an R-factor and Rfree cutoff of 0.4?

        • We think this cutoff is quite high, and considering the quality of ligand depositions in the PDB, could impact the evaluation using this benchmark.

        • It also seems possible to re-refine all these deposited structures with an established refinement pipeline (e.g. Phenix) to obtain higher quality baseline structures.

    • While it is a useful result to quantitatively evaluate the accuracy on low-similarity complexes, we wonder what qualitative factors lead to inaccurate prediction. Figure 1D and 2B indicate there is a decrease in accuracy that is not significantly caused by mispredictions in the protein part of the binding site, and we think an explicit examination of ligand positioning (e.g. simple rotations and translations) or conformational differences contributing to the decrease in accuracy would provide insight into particular vectors of improvement for modeling.

    • We are especially interested in accuracy of protein-ligand co-folding depending on molecular characteristics of the ligand. The authors tangentially address this by showing accuracy is largely invariant across ligand size, but we think a deeper look at accuracy on particular types of ligands (e.g. largely hydrophobic) or particular functional groups (e.g. aromatic rings, functional groups with low frequency in the training set) would help evaluation of model capabilities.

    • We were surprised that there is not a trend for predicting lower molecular weight and lower rotatable bonds more accurately (Supp. Fig. S5), as it seems that it would be easier to compute accurate pairwise interactions with fewer degrees of freedom. We think it would be valuable to discuss this point in more detail.

    Minor Points

    • In the discussion on page 9, the authors mention testing systems containing the same protein and ligand but with multiple ligand binding modes. We look forward to the inclusion of this experiment, and also think that it would be valuable to include tests of multiple similar ligands (e.g. from fragment screening data) binding in different modes to the same protein pocket.

    • We noticed some mistakes in figure references and captions:

      • In paragraph 2 of results: "Figure 1 shows the common subset of obtained predictions across all four methods" when this is not part of the figure. This paragraph also references Figure S2, which also does not show that information.

      • There is a reference to Figure 2C on page 6, which is not a panel in Figure 2.

      • Panels referenced in the Figure 3 caption for the last similarity bin (80-100) are mislabeled as (F-G), should be (G-H).

      • Many links to supplemental figures and tables direct to the non-supplemental figure of the same number.

    • We think it would be useful to provide a comparison table between Runs N' Poses and existing benchmarks like PoseBusters and PLINDER.

    • An additional discussion around chirality mismatches in predictions would be interesting - are they generally caused by predicting like a training set ligand, as shown in the one example? Or are chirality errors generally stochastic?

    • The AF3 validation results do not significantly address increasing the number of recycles [1], which was demonstrated to increase success rates for AF2. Does increasing the number of recycles in AF3-like models seem to improve performance meaningfully on this benchmark?

    • Evaluating physics-based folding and docking systems as outlined in the Limitations and Future plans section would be an excellent addition to the paper.

    • We have a few comments on the appendix on iPTM:

      • We think presenting the confusion matrix using a standardized normal cutoff (e.g. 0.8) would be more informative for the standard use of iPTM. Because of the differences between each model, there is no guarantee that the optimal iPTM thresholds found here will generalize to structures outside of this benchmark.

      • Confidence module architectures differ significantly between some models (particularly Boltz-1, with Boltz-1 using a full trunk architecture for the confidence module). We think it would be useful to discuss this and its implications, especially considering that Boltz-1 confidence at the presented threshold achieves a high recall.

      • It would be useful to mention why the iPTM matrices are asymmetric for Chai-1 and Boltz-1. In a remark from the Boltz-1 authors, we heard that this is because Boltz-1 aggregates the PAE, which is an asymmetric matrix (shown in algorithm 31 in AF3 supplement), directionally across the interface. It may be that AF3 and Proteinix only compute a single direction or take the max value over both directions to obtain a symmetric matrix.

    Competing interests

    The authors declare that they have no competing interests.

  2. The future of benchmarking for deep learning methods, as we move toward more complex multidimensional taskssuch as co-folding, requires different measures to assess leakage and difficulty, whether for protein-protein interactions(PPIs), protein-ligand interactions (PLIs), or protein-nucleotide complexes

    is one possible solution (or at least mitigation strategy) to include some kind of confidence score in PLI prediction scores?

  3. This is also particularly crucial for taskswith even more limited data, such as those involving covalent bonds and modified residues

    does it follow from this that training PLMs on smaller, more uniformly sampled and diverse datasets could lead to better generalization? or is this overfitting an unavoidable feature of larrge models?

  4. We demonstrate that the performanceof current approaches strongly correlates with the similarity to their training data, regardless of the metric used todefine success or the subsets considered

    is there a metric that can be used for subsequent models that can give users insights into how problematic overfitting is in that particular model? something that takes the training set similarity into account.