Establishing the foundations for a data-centric AI approach for virtual drug screening through a systematic assessment of the properties of chemical data

Allen Chong
Ser-Xian Phua
Yunzhi Xiao
Woon Yee Ng
Hoi Yeung Li
Wilson Wen Bin Goh

Curated by eLife

eLife Assessment

This study reports valuable findings that highlight the importance of data quality and data representation for ligand-based virtual screening experiments. The authors' claims are supported by solid evidence, although the conclusions have been inferred from only two datasets. The work would gain much impact if additional datasets were used. The main findings will be of interest to cheminformaticians and medicinal chemists working in QSAR modeling, and possibly in other areas related to machine learning.

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (eLife)

Abstract

Summary

Researchers have adopted model-centric artificial intelligence (AI) approaches in cheminformatics by using newer, more sophisticated AI methods to take advantage of growing chemical libraries. It has been shown that complex deep learning methods outperform conventional machine learning (ML) methods in QSAR and ligand-based virtual screening1–3 but such approaches generally lack explanability. Hence, instead of developing more sophisticated AI methods (i.e., pursuing a model-centric approach), we wanted to explore the potential of a data-centric AI paradigm for virtual screening. A data-centric AI is an intelligent system that would automatically identify the right type of data to collect, clean and curate for later use by a predictive AI and this is required given the large volumes of chemical data that exist in chemical databases – PubChem alone has over 100 million unique compounds. However, a systematic assessment of the attributes and properties of suitable data is needed. We show here that it is not the result of deficiencies in current AI algorithms but rather, poor understanding and erroneous use of chemical data that ultimately leads to poor predictive performance. Using a new benchmark dataset of BRAF ligands that we developed, we show that our best performing predictive model can achieve an unprecedented accuracy of 99% with a conventional ML algorithm (SVM) using a merged molecular representation (Extended + ECFP6 fingerprints), far surpassing past performances of virtual screening platforms using sophisticated deep learning methods. Thus, we demonstrate that it is not necessary to resort to the use of sophisticated deep learning algorithms for virtual screening because conventional ML can perform exceptionally well if given the right data and representation. We also show that the common use of decoys for training leads to high false positive rates and its use for testing will result in an over-optimistic estimation of a model’s predictive performance. Another common practice in virtual screening is defining compounds that are above a certain pharmacological threshold as inactives. Here, we show that the use of these so-called inactive compounds lowers a model’s sensitivity/recall. Considering that some target proteins have a limited number of known ligands, we wanted to also observe how the size and composition of the training data impact predictive performance. We found that an imbalance training dataset where inactives outnumber actives led to a decrease in recall but an increase in precision, regardless of the model or molecular representation used; and overall, we observed a decrease in the model’s accuracy. We highlight in this study some of the considerations that one needs to take into account in future development of data-centric AI for CADD.

Version published to 10.7554/elife.97821.2 on eLife
Dec 17, 2024
Version published to 10.7554/elife.97821 on eLife
Dec 17, 2024
eLife
Dec 16, 2024

eLife Assessment

This study reports valuable findings that highlight the importance of data quality and data representation for ligand-based virtual screening experiments. The authors' claims are supported by solid evidence, although the conclusions have been inferred from only two datasets. The work would gain much impact if additional datasets were used. The main findings will be of interest to cheminformaticians and medicinal chemists working in QSAR modeling, and possibly in other areas related to machine learning.

Read the original source
eLife
Dec 16, 2024

Reviewer #1 (Public review):

Summary:

The work provides more evidence of the importance of data quality and representation for ligand-based virtual screening approaches. The authors have applied different machine learning (ML) algorithms and data representation using a new dataset of BRAF ligands. First, the authors evaluate the ML algorithms, and demonstrate that independently of the ML algorithm, predictive and robust models can be obtained in this BRAF dataset. Second, the authors investigate how the molecular representations can modify the prediction of the ML algorithm. They found that in this highly curated dataset the different molecule representations are adequate for the ML algorithms since almost all of them obtain high accuracy values, with Estate fingerprints obtaining the worst performing predictive models and ECFP6 …

Reviewer #1 (Public review):

Summary:

The work provides more evidence of the importance of data quality and representation for ligand-based virtual screening approaches. The authors have applied different machine learning (ML) algorithms and data representation using a new dataset of BRAF ligands. First, the authors evaluate the ML algorithms, and demonstrate that independently of the ML algorithm, predictive and robust models can be obtained in this BRAF dataset. Second, the authors investigate how the molecular representations can modify the prediction of the ML algorithm. They found that in this highly curated dataset the different molecule representations are adequate for the ML algorithms since almost all of them obtain high accuracy values, with Estate fingerprints obtaining the worst performing predictive models and ECFP6 fingerprints producing the best classificatory models. Third, the authors evaluate the performance of the models on subsets of different composition and size of the BRAF dataset. They found that given a finite number of active compounds, increasing the number of inactive compounds worsens the recall and accuracy. Finally, the authors analyze if the use of "less active" molecules affect the model's predictive performance using "less active" molecules taken from ChEMBl Database or using decoys from DUD-E. As results, they found that the accuracy of the model falls as the number of "less active" examples in the training dataset increases while the implementation of decoys in the training set generates results as good as the original models or even better in some cases. However, the use of decoys in the training set worsens the predictive power in the test sets that contain active and inactive molecules.

Strengths:

This is a highly relevant topic in medicinal chemistry and drug discovery. The manuscript is well-written, with a clear structure that facilitates easy reading, and it includes up-to-date references. The hypotheses are clearly presented and appropriately explored. The study provides valuable insights into the importance of deriving models from high-quality data, demonstrating that, when this condition is met, complex computational methods are not always necessary to achieve predictive models. Furthermore, the generated BRAF dataset offers a valuable resource for medicinal chemists working in ligand-based virtual screening.

Weaknesses:

While the work highlights the importance of using high-quality datasets to achieve better and more generalizable results, it does not present significant novelty, as the analysis of training data has been extensively studied in chemoinformatics and medicinal chemistry. Additionally, the inclusion of "AI" in the context of data-centric AI is somewhat unclear, given that the dataset curation is conducted manually, selecting active compounds based on IC50 values from ChEMBL and inactive compounds according to the authors' criteria.

Moreover, the conclusions are based on the analysis of only two high-quality datasets. To generalize these findings, it would be beneficial to extend the analysis to additional high-quality datasets (at least 10 datasets for a robust benchmarking exercise).

A key aspect that could be improved is the definition of an "inactive" compound, which remains unclear. In the manuscript, it is stated:

• "The inactives were carefully selected based on the fact that they have no known pharmacological activity against BRAF."
Does the lack of BRAF activity data necessarily imply that these compounds are inactive?
• "We define a compound as 'inactive' if there are no known pharmacological assays for the said compound on our target, BRAF."
However, in the authors' response, they mention:
• "We selected certain compounds that we felt could not possibly be active against BRAF, such as ligands for neurotransmitter receptors, as inactives."

Given that the definition of "inactive" is one of the most critical concepts in the study, I believe it should be clearly and consistently explained.

Lastly, while statistical comparison is not always common in machine learning, it would greatly enhance the value of this work, especially when comparing models with small differences in accuracy.

Read the original source
eLife
Dec 16, 2024

Reviewer #2 (Public review):

Summary:

The authors explored the importance of data quality and representation for ligand-based virtual screening approaches. I believe the results could be of potential benefit to the drug discovery community, especially to those scientists working in the field of machine learning applied to drug research. The in silico design is comprehensive and adequate for the proposed comparisons.

This manuscript by Chong A. et al describes that it is not necessary to resort to the use of sophisticated deep learning algorithms for virtual screening, since based on their results considering conventional ML may perform exceptionally well if feeded by the right data and molecular representations.

The article is interesting and well-written. The overview of the field and the warning about dataset composition are very well …

Reviewer #2 (Public review):

Summary:

The authors explored the importance of data quality and representation for ligand-based virtual screening approaches. I believe the results could be of potential benefit to the drug discovery community, especially to those scientists working in the field of machine learning applied to drug research. The in silico design is comprehensive and adequate for the proposed comparisons.

This manuscript by Chong A. et al describes that it is not necessary to resort to the use of sophisticated deep learning algorithms for virtual screening, since based on their results considering conventional ML may perform exceptionally well if feeded by the right data and molecular representations.

The article is interesting and well-written. The overview of the field and the warning about dataset composition are very well thought-out and should be of interest to a broad segment of the AI in drug discovery readership. This article further highlights some of the considerations that need to be taken into consideration for the implementation of data-centric AI for computer-aided drug design methods.

Strengths:

This study contributes significantly to the field of machine learning and data curation in drug discovery. The paper is, in general, well-written and structured. However, in my opinion, there are some suggestions regarding certain aspects of the data analyses.

Weaknesses:

The conclusions drawn in the study are based on the analysis of a two dataset. The authors chose BRAF as an example in this study, and expanded with BACE-1 dataset; however a benchmark with several targets would be suitable to evaluate reproducibility or transferability of the method. One concern could be the applicability of the method in other targets.

Read the original source
eLife
Dec 16, 2024

Reviewer #3 (Public review):

Summary:

The authors presented a data-centric ML approach for virtual ligand screening. They used BRAF as an example to demonstrate the predictive power of their approach.

Strengths:

The performance of predictive models in this study is superior (nearly perfect) with respect to exiting methods.

Comments on revisions:

In the revised manuscript, the presented approach has been robustly tested and can be very useful for ligand prediction.

Read the original source
eLife
Dec 16, 2024

Author response:

The following is the authors’ response to the original reviews.

We thank the Editors and reviewers for their candid evaluation of our work. While it was suggested that we should demonstrate the validity of our approach with maybe 10 different datasets but we felt that this would place an undue burden on our resources. Generally, it takes about 4 to 6 months for us to build a dataset and this does not include the time taken to train and test our AI models. This would mean that it would take us another 3 to 5 years to complete this research project if we chose to provide 10 different datasets. Publishing a research on one dataset is definitely not unheard of: for example, Subramanian et al. (2016) published their widely-cited benchmark dataset for just BACE1 inhibitors. However, we hoped that the additional work where we …

Author response:

The following is the authors’ response to the original reviews.

We thank the Editors and reviewers for their candid evaluation of our work. While it was suggested that we should demonstrate the validity of our approach with maybe 10 different datasets but we felt that this would place an undue burden on our resources. Generally, it takes about 4 to 6 months for us to build a dataset and this does not include the time taken to train and test our AI models. This would mean that it would take us another 3 to 5 years to complete this research project if we chose to provide 10 different datasets. Publishing a research on one dataset is definitely not unheard of: for example, Subramanian et al. (2016) published their widely-cited benchmark dataset for just BACE1 inhibitors. However, we hoped that the additional work where we showed that we were able to improve the benchmark dataset for BACE1 inhibitors and achieve the same high level of predictive performance for this dataset would convince the readers (and reviewers) of the reproducibility of our approach. Furthermore, we also showed that our approach is robust and does not rely on a large volume of data to achieve this near-perfect accuracy. As can be seen in the Supplemental section, even our AI models trained on ONLY 250 BRAF actives and 250 inactives could achieve 96.3% accuracy! Logically, if the model is robust then we would expect the model to be reproducible. As such, we do not feel it is necessary for us to test our approach on 10 different datasets.

It was also suggested that we expand this study to other types of molecular representations to give a better idea of generalizability. We would like to point out that we tested, in total, 55 single fingerprints and paired combinations. Our goal was to create an approach that could give superior performance for virtual screening and we believe that we have achieved this. Based on the results of our study, we are of the opinion that molecular representations do not, in general, have an oversized effect on AI virtual screening. Although it is important to be aware that certain molecular representations may give SLIGHTLY better performance but we can see that with the exception of the 79-bit E-State fingerprint (which could still achieve an impressive 85% accuracy for the SVM model), nearly all molecular fingerprints and paired combinations that we used were able to achieve an accuracy of above 97%. Therefore, we do not share the reviewers' concern that our approach may not be useful when applied with other types of molecular representations.

It is true that our work involved manual curation of the datasets but the goal of this paper is to lay down some ground rules for the future development of a data-centric AI approach. Although manual curation is a routine practice in AI/ML, but it should be recognised that there is good manual curation and bad manual curation, and rules need to be established to ensure we have good manual curation. Without these rules, we would also not be able to establish and train a data-centric AI. All manual curation involves a level of subjectiveness but that subjectiveness comes from one's experience and domain knowledge of the field in which the AI is being applied. For example, in the case of this study, we relied on our knowledge and understanding of pharmacology to determine whether a compound is pharmacologically inactive or active. This may seem somewhat arbitrary to the uninitiated but it is anything but arbitrary. It is through careful thought and assessment of the chemical compounds that we choose these compounds for training the AI. Unfortunately, this sort of subjective assessment cannot be easily or completely explained but we do show where current practices have failed when building a dataset for training an AI for virtual screening.

Read the original source
Version published to 10.7554/elife.97821.1 on eLife
Jun 14, 2024
eLife
Jun 13, 2024

eLife assessment

This study presents a valuable finding on how data quality and data representation are key to obtain predictive machine learning models, even without resorting to complex machine learning approaches. The evidence supporting the claims of the authors is, however, incomplete, as their conclusions are drawn from a single dataset of big size, similarity analysis within and between subsets is lacking, and there are concerns regarding the composition of the training and holdout sets (active:inactive ratio, possible triviality of decoys). If the results were expanded to other quality datasets of different compositions to demonstrate robustness, the manuscript would be of wide interest in the machine learning and drug discovery fields

Read the original source
eLife
Jun 13, 2024

Reviewer #1 (Public Review):

Summary:

The work provides more evidence of the importance of data quality and representation for ligand-based virtual screening approaches. The authors have applied different machine learning (ML) algorithms and data representation using a new dataset of BRAF ligands. First, the authors evaluate the ML algorithms and demonstrate that independently of the ML algorithm, predictive and robust models can be obtained in this BRAF dataset. Second, the authors investigate how the molecular representations can modify the prediction of the ML algorithm. They found that in this highly curated dataset the different molecule representations are adequate for the ML algorithms since almost all of them obtain high accuracy values, with Estate fingerprints obtaining the worst-performing predictive models and ECFP6 …

Reviewer #1 (Public Review):

Summary:

The work provides more evidence of the importance of data quality and representation for ligand-based virtual screening approaches. The authors have applied different machine learning (ML) algorithms and data representation using a new dataset of BRAF ligands. First, the authors evaluate the ML algorithms and demonstrate that independently of the ML algorithm, predictive and robust models can be obtained in this BRAF dataset. Second, the authors investigate how the molecular representations can modify the prediction of the ML algorithm. They found that in this highly curated dataset the different molecule representations are adequate for the ML algorithms since almost all of them obtain high accuracy values, with Estate fingerprints obtaining the worst-performing predictive models and ECFP6 fingerprints producing the best classificatory models. Third, the authors evaluate the performance of the models on subsets of different composition and size of the BRAF dataset. They found that given a finite number of active compounds, increasing the number of inactive compounds worsens the recall and accuracy. Finally, the authors analyze if the use of "less active" molecules affect the model's predictive performance using "less active" molecules taken from ChEMBl Database or using decoys from DUD-E. As results, they found that the accuracy of the model falls as the number of "less active" examples in the training dataset increases while the implementation of decoys in the training set generates results as good as the original models or even better in some cases. However, the use of decoys in the training set worsens the predictive power in the test sets that contain active and inactive molecules.

Strengths:

It is a very interesting topic in medicinal chemistry and drug discovery. This work is very well written and contains up-to-date references. The general structure of the work is adequate, allowing easy reading. The hypotheses are clear and were explored correctly. This work provides new evidence about the importance of inferring models from high-quality data and that, if such a condition is met, it is not necessary to use complex computational methods to obtain predictive models. The generated BRAF dataset is also a valuable benchmark dataset for medicinal chemists working in ligand based virtual screening.

Weaknesses:

Leaving aside the new curated BRAF dataset, the work lacks novelty since it is a topic widely studied in chemoinformatics and medicinal chemistry. Furthermore, the conclusions drawn here correspond to the analysis of only one high-quality dataset where the similarity between the molecules is not quantitatively assessed (maybe active and inactive molecules are very dissimilar and any ML algorithm and fingerprint could obtain good results). To generalize the conclusions, it would be fundamental to repeat the analysis with other high-quality datasets.

Some key tasks are not clearly described, for example, there is no information about the new BRAF dataset (e.g., where the molecules were obtained from or why the inactive molecules provide better results than the "less active" from ChEMBL... what differentiates them?). The defintion of an "inactive" compound is not clear. It is not described if global or balanced accuracy was used in the imbalanced datasets. When using decoys to evaluate the models it is important to consider that decoys were generated to be topologically different from active compounds by the comparison of the ECFP4 fingerprints using the Tanimoto coefficient. Therefore, it is quite obvious that when fingerprints are used to characterize molecules, the models will be able to easily discriminate them. It is important to note that this is not necessarily true for models based on other molecular descriptors, since they are not used in the generation of the decoys. In some cases, the differences between accuracies are very small and there are no statistical analyzes to demonstrate whether they are statistically different or not.

Read the original source
eLife
Jun 13, 2024

Reviewer #2 (Public Review):

Summary:

The authors explored the importance of data quality and representation for ligand-based virtual screening approaches. I believe the results could be of potential benefit to the drug discovery community, especially to those scientists working in the field of machine learning applied to drug research. The in silico design is comprehensive and adequate for the proposed comparisons.

This manuscript by Chong A. et al describes that it is not necessary to resort to the use of sophisticated deep learning algorithms for virtual screening, since based on their results considering conventional ML may perform exceptionally well if fed by the right data and molecular representations.

The article is interesting and well-written. The overview of the field and the warning about dataset composition are very well …

Reviewer #2 (Public Review):

Summary:

The authors explored the importance of data quality and representation for ligand-based virtual screening approaches. I believe the results could be of potential benefit to the drug discovery community, especially to those scientists working in the field of machine learning applied to drug research. The in silico design is comprehensive and adequate for the proposed comparisons.

This manuscript by Chong A. et al describes that it is not necessary to resort to the use of sophisticated deep learning algorithms for virtual screening, since based on their results considering conventional ML may perform exceptionally well if fed by the right data and molecular representations.

The article is interesting and well-written. The overview of the field and the warning about dataset composition are very well thought-out and should be of interest to a broad segment of the AI in drug discovery readership. This article further highlights some of the considerations that need to be taken into consideration for the implementation of data-centric AI for computer-aided drug design methods.

Strengths:

This study contributes significantly to the field of machine learning and data curation in drug discovery. The paper is, in general, well-written and structured. However, in my opinion, there are some suggestions regarding certain aspects of the data analyses.

Weaknesses:

The conclusions drawn in the study are based on the analysis of a single dataset, and I am not sure they can be generalized. Therefore, in my opinion, the conclusions are only partially supported by the data. To generalize the conclusions, it is imperative to conduct a benchmark with diverse datasets, for different molecular targets.
The conclusion cannot be immediately extended to molecular descriptors or features different from the ones used in this study
It is advisable to present statistical analyses to ascertain whether the observed differences in metrics hold statistical significance.

Read the original source
eLife
Jun 13, 2024

Reviewer #3 (Public Review):

Summary:

The authors presented a data-centric ML approach for virtual ligand screening. They used BRAF as an example to demonstrate the predictive power of their approach.

Strengths:

The performance of predictive models in this study is superior (nearly perfect) with respect to exiting methods.

Weaknesses:

I feel the training and testing datasets may not be rigorously constructed. If that is the case, the results would be significantly affected.

I have 3 major comments:

(1) The authors identified ~4100 BRAF actives, then randomly selected 3600 BRAF actives to be part of the training dataset with the remaining 500 actives becoming a part of the hold-out test set. The problem is that, the authors did not evaluate the chemical similarity between the 3600 actives in the training, and the 500 actives in the …

Reviewer #3 (Public Review):

Summary:

The authors presented a data-centric ML approach for virtual ligand screening. They used BRAF as an example to demonstrate the predictive power of their approach.

Strengths:

The performance of predictive models in this study is superior (nearly perfect) with respect to exiting methods.

Weaknesses:

I feel the training and testing datasets may not be rigorously constructed. If that is the case, the results would be significantly affected.

I have 3 major comments:

(1) The authors identified ~4100 BRAF actives, then randomly selected 3600 BRAF actives to be part of the training dataset with the remaining 500 actives becoming a part of the hold-out test set. The problem is that, the authors did not evaluate the chemical similarity between the 3600 actives in the training, and the 500 actives in the testing set. If some of them were similar, the testing results would be very good but partially due to information leakage. The authors should carefully examine the chemical similarity between any pairs of their training and testing datasets, before any conclusion is made.

(2) The authors tried to explore the role of dataset size in the performance, in particular, what would happen when the number of actives are reduced. However the minimal number of actives used is 500 while the number of inactives ranges from 500 to 3600. This is quite different from real applications where the number of expected actives in the screening library would be at most 1-2% of the whole database. The authors should further reduced the number of actives (e.g. 125, 25, 5, 1), and evaluate their model's performance.

(3) The authors chose BRAF as example in this study. BRAF is a well studied drug target with thousands of known actives. In real applications, the target may only have a handful of known actives. The authors should try to apply their approach, to a couple other targets that have less known actives than BRAF, to evaluate their method's transferability.

Read the original source
Version published to 10.1101/2024.03.28.587184 on bioRxiv
Mar 31, 2024

Artificial Intelligence and Machine Learning for De Novo Cancer Drug Discovery: A Systematic Review of Generative Design and Validation Gaps

This article has 4 authors:
1. Hashim Hashim
2. Fahad Abubakr
3. Mohamed Elhassadi
4. Ali Hasnain
This article has no evaluationsLatest version Dec 23, 2025
Integrating Computational Biology in Modern Drug Discovery: A Synergistic Approach of Structure-Based, Ligand-Based, and Network Pharmacology Strategies

This article has 4 authors:
1. Cromwel Tepap Zemnou
2. Gabriel Tchuente Kamsu
3. Ramelle Ngakam
4. Etienne Junior Tcheumeni
This article has no evaluationsLatest version Jan 29, 2026
Drug discovery guided by maximum drug likeness

This article has 3 authors:
1. Hao-Yu Zhu
2. Lu Xu
3. Wei Shi
This article has no evaluationsLatest version Dec 31, 2025

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Artificial Intelligence and Machine Learning for De Novo Cancer Drug Discovery: A Systematic Review of Generative Design and Validation Gaps

Integrating Computational Biology in Modern Drug Discovery: A Synergistic Approach of Structure-Based, Ligand-Based, and Network Pharmacology Strategies

Drug discovery guided by maximum drug likeness