CAN-TGI: Confidence-Aware Negative Sampling for Predicting TF-Target Gene Interaction via Heterogeneous Biological Networks Embedding
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Identifying transcription factor (TF)-target gene interactions is essential for understanding gene regulatory networks and disease mechanisms. Recent advances in network embedding have enabled more effective extraction of both structural and semantic information from TF-target gene interaction networks. However, a major challenge persists: the severe imbalance between known (positive) interactions and the vast number of unlabeled TF-target gene pairs, many of which may represent true but undiscovered associations. Naively considering all unlabeled pairs as negatives introduces label noise, biases training, and limits model generalization. In this study, we introduce CAN-TGI, a Confidence-Aware negative sampling framework that integrates statistical filtering and biological semantics to select high-confidence negative samples. CAN-TGI employs a two-phase selection strategy, utilizing One-Class SVM for outlier elimination and probabilistic refinement to ensure the exclusion of biologically plausible but unverified interactions. Combined with meta-path-based random walks and skip-gram embedding within a heterogeneous biological network, our method effectively captures both structural and semantic contexts. Extensive experiments on benchmark datasets demonstrate that CAN-TGI achieves superior performance, with an average AUC of 0.9764 ± 0.0043 under five-fold cross-validation, significantly outperforming existing state-of-the-art methods. Furthermore, case studies on TFs such as TP53 and CEBPA validate the model's robustness in predicting novel regulatory associations, underscoring its potential for applications in regulatory genomics and precision medicine.