A hybrid machine learning framework leveraging biophysicochemical insights for scalable discovery of protein-ligand interactions

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Improving in silico compound-protein interaction (CPI) predictability is critical for productive drug discovery. Current deep learning approaches largely rely on end-to-end models trained on limited labeled CPI datasets, overlooking the representational power of large-scale biochemical foundation models. We present COMRADE (Contrastive Multirepresentation Accelerated Docking Engine), a hybrid virtual screening framework that accelerates docking by triaging compounds using CE-Screen (Contrastive Embedding-Screen). CE-Screen leverages seven high-dimensional pretrained representations – including those from protein language models and molecular transformers, along with an original physics-based interaction potential encoding – for rapid first-pass screening ∼100× faster than docking. Its contrastive compression neural network maps these inputs onto a single compact, discriminative representation optimized for CPI prediction via a lightweight ensemble classifier. CE-Screen outperforms state-of-the-art end-to-end models by up to 111.11% on retrospective benchmarks and is successfully used to triage ∼10.8 million compounds against five targets, yielding novel hits for each one – including a new scaffold for the branched-chain ketoacid dehydrogenase kinase (BCKDK), an understudied yet high-value target in metabolic disease and oncology.

Article activity feed