Class-balanced negative training sets for improving classifier model predictions of enhancer-promoter interactions

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: Enhancers regulate gene expression by forming DNA loops, thereby bringing themselves in close proximity to the target gene promoter. The human genome contains hundreds of thousands of enhancers, vastly outnum- bering its 20,000-25,000 protein-coding genes, highlighting the importance of enhancer-promoter interactions (EPIs) in gene regulation. Supervised learning models have been developed to predict EPIs, often using experimentally validated interacting enhancer-promoter pairs and artificially gen- erated negative samples. However, the lack of reliable negative samples presents a challenge. Current methods randomly select pairs from unlabeled data, leading to class imbalance and reduced predictive performance. This imbalance, where enhancers and promoters are unevenly distributed between the positive and negative sets, hinders classifiers from learning meaningful patterns. Therefore, constructing more reliable negative samples is crucial for improving the accuracy of EPI predictions. Results: We developed two methods to generate class-balanced negative train- ing sets for EPI classifiers: one based on maximum ow and the other on Gibbs sampling. We evaluated these methods with the TargetFinder and TransEPI classifiers across five and six cell lines, respectively. The trained models were tested using a common negative test set. Our negative training sets significantly improved the prediction performance across several metrics, including precision, recall, and area under the receiver operating characteristic curve. Conclusions: Our findings demonstrate that carefully designed negative samples can enhance the performance of EPI classifiers. Further advanced methods in generating negative EPIs should further improve prediction accuracy. The source code is available at https://github.com/maruyama-lab-design/CBOEP2.

Article activity feed