Deep contrastive feature compression with classical machine learning enables ligand discovery through efficient triage of large chemical libraries
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Improving in silico compound-protein interaction (CPI) predictability is critical for productive drug discovery. Current deep learning approaches largely rely on end-to-end models trained on limited labeled CPI data, overlooking preexisting, specialized compound-protein input representations. We present Ligand Extra trees-Accelerated Docking (LEAD), a virtual screening framework that accelerates docking by integrating rapid first-pass CPI prediction via ET-Screen. Unlike end-to-end models, ET-Screen uses seven distinct representations, including embeddings from large-scale protein language models and molecular transformers, as well as an original CPI potential fingerprint. ET-Screen’s contrastive compression neural network maps the consolidated 2,457-dimensional compound-protein representation into a compact, discriminative form optimized for CPI classification by an ensemble of decision trees. ET-Screen outperforms state-of-the-art end-to-end approaches by up to 23.39% on diverse retrospective benchmarks while being ∼100× faster than standard-precision docking. Its speed enables triaging ∼10.8 million prospective drug candidates across five targets, reducing them to 10,000 for docking within the LEAD framework and ultimately yielding novel, experimentally-validated hits.