Utilizing Data Imbalance to Enhance Compound–Protein Interaction Prediction Models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Identifying potential compounds for target proteins is crucial in drug discovery. Current compound–protein interaction prediction models rely on complex features to enhance capabilities, but this often incurs substantial computational burdens. Indeed, this challenge arises from the limited understanding of data imbalance between proteins and compounds, leading to insufficient optimization of protein encoders. To address this issue, a sequence‐based predictor named FilmCPI is introduced, which leverages data imbalance by learning proteins with their numerous corresponding compounds. This approach enables the characteristics of each protein to be effectively represented through itself and its relationship with corresponding compounds. Without increasing parameters, FilmCPI consistently outperforms baseline models across diverse datasets and split strategies, and its generalization to unseen proteins becomes more pronounced as the datasets expand. Notably, FilmCPI can be effectively transferred to unseen membrane protein families with sequence‐based data from other families. The optimization dynamics is further analyzed and it is discovered that the effectiveness of FilmCPI is attributed to different optimization speeds for diverse encoders. Overall, this work aims to provide a theoretical perspective for designing efficient models.