Utilizing data imbalance to enhance compound-protein interaction prediction models
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Identifying potential compounds for target proteins is crucial in drug discovery. Current compound-protein interaction prediction models concentrate on utilizing more complex features to enhance capabilities, but this often incurs substantial computational burdens. Indeed, this issue arises from the limited understanding of data imbalance between proteins and compounds, leading to insufficient optimization of protein encoders. Therefore, we introduce a sequence-based predictor named FilmCPI, designed to utilize data imbalance to learn proteins with their numerous corresponding compounds. FilmCPI consistently outperforms baseline models across diverse datasets and split strategies, and its generalization to unseen proteins becomes more pronounced as the datasets expand. Notably, FilmCPI can be transferred to unseen protein families with sequence-based data from other families, exhibiting its practicability. The effectiveness of FilmCPI is attributed to different optimization speeds for diverse encoders, elucidating optimization imbalance in compound-protein prediction models. Additionally, these advantages of FilmCPI do not depend on increasing parameters, aiming to lighten model design with data imbalance.