How many crystal structures do you need to trust your docking results?
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Structure-based drug discovery technologies generally require the prediction of putative bound poses of protein:small molecule complexes to prioritize them for synthesis. The predicted structures are used for a variety of downstream tasks such as pose-scoring functions or as a starting point for binding free energy estimation. The accuracy of downstream models depends on how well predicted poses match experimentally-validated poses. Although the ideal input to these downstream tasks would be experimental structures, the time and cost required to collect new experimental structures for synthesized compounds makes obtaining this structure for every input intractable. Thus, leveraging available structural data is required to efficiently extrapolate new designs. Using data from the open science COVID Moonshot project—where nearly every compound synthesized was crystallographically screened—we assess several popular strategies for generating docked poses in a structure-enabled discovery program using both retrospective and prospective analyses. We explore the tradeoff between the cost of obtaining crystal structures and the utility for accurately predicting poses of newly designed molecules. We find that a simple strategy using molecular similarity to identify relevant structures for template-guided docking is successful in predicting poses for the SARS-CoV-2 main viral protease. Further efficiency analysis suggests template-based docking of a scaffold series is a robust strategy even when the quantity of available structural data is limited. The resulting open source pipeline and curated datasets should prove useful for automated modeling of bound poses for downstream scoring, machine learning, and free energy calculation tasks for structure-based drug discovery programs.