CPIExtract: A software package to collect and harmonize small molecule and protein interactions

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Summary

The binding interactions between small molecules and proteins are the basis of cellular functions. Yet, experimental data available regarding compound-protein interactions (CPIs) are not harmonized into a single entity but rather scattered across multiple institutions, each maintaining databases with different formats. Extracting information from these multiple sources remains challenging due to data heterogeneity. Here, we present CPIExtract (Compound-Protein Interaction Extract), a Python package that automatically retrieves CPI data from nine major repositories, filters non-human and low-quality records, harmonizes chemical and protein identifiers, and computes unified pChEMBL binding values. Compared with MINER, a state-of-the-art CPI extraction algorithm, CPIExtract retrieves 85.5% more compounds, 16-fold more experimentally supported interactions, and over four times more proteins, substantially increasing the availability of strong and weak binders. The resulting harmonized dataset enables custom filtering and export in standard tabular formats for downstream applications such as network medicine, drug repurposing, and training of deep learning models.

Availability

CPIExtract is an open-source Python package under an MIT license. CPIExtract can be downloaded from https://github.com/menicgiulia/CPIExtract and https://pypi.org/project/cpiextract . The package can run on any standard desktop computer or computing cluster.

Article activity feed