A hybrid machine learning framework leveraging biophysicochemical insights for scalable discovery of protein-ligand interactions
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Improving in silico compound-protein interaction (CPI) predictability is critical for productive drug discovery. Current deep learning approaches largely rely on end-to-end models trained on limited labeled CPI datasets, overlooking the representational power of large-scale biochemical foundation models. We present COMRADE (Contrastive Multirepresentation Accelerated Docking Engine), a hybrid virtual screening framework that accelerates docking by triaging compounds using CE-Screen (Contrastive Embedding-Screen). CE-Screen leverages seven high-dimensional pretrained representations – including those from protein language models and molecular transformers, along with an original physics-based interaction potential encoding – for rapid first-pass screening ∼100× faster than docking. Its contrastive compression neural network maps these inputs onto a single compact, discriminative representation optimized for CPI prediction via a lightweight ensemble classifier. CE-Screen outperforms state-of-the-art end-to-end models by up to 111.11% on retrospective benchmarks and is successfully used to triage ∼10.8 million compounds against five targets, yielding novel hits for each one – including a new scaffold for the branched-chain ketoacid dehydrogenase kinase (BCKDK), an understudied yet high-value target in metabolic disease and oncology.