CAN-TGI: Confidence-Aware Negative Sampling for Predicting TF-Target Gene Interaction via Heterogeneous Biological Networks Embedding

Thanh Tuoi Le
Xuan Tho Dang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Identifying transcription factor (TF)-target gene interactions is essential for understanding gene regulatory networks and disease mechanisms. Recent advances in network embedding have enabled more effective extraction of both structural and semantic information from TF-target gene interaction networks. However, a major challenge persists: the severe imbalance between known (positive) interactions and the vast number of unlabeled TF-target gene pairs, many of which may represent true but undiscovered associations. Naively considering all unlabeled pairs as negatives introduces label noise, biases training, and limits model generalization. In this study, we introduce CAN-TGI, a Confidence-Aware negative sampling framework that integrates statistical filtering and biological semantics to select high-confidence negative samples. CAN-TGI employs a two-phase selection strategy, utilizing One-Class SVM for outlier elimination and probabilistic refinement to ensure the exclusion of biologically plausible but unverified interactions. Combined with meta-path-based random walks and skip-gram embedding within a heterogeneous biological network, our method effectively captures both structural and semantic contexts. Extensive experiments on benchmark datasets demonstrate that CAN-TGI achieves superior performance, with an average AUC of 0.9764 ± 0.0043 under five-fold cross-validation, significantly outperforming existing state-of-the-art methods. Furthermore, case studies on TFs such as TP53 and CEBPA validate the model's robustness in predicting novel regulatory associations, underscoring its potential for applications in regulatory genomics and precision medicine.

Version published to 10.21203/rs.3.rs-7323024/v1 on Research Square
Aug 20, 2025

Path-Probability Models Outperform Point-Estimate Scores for Noncoding GWAS Gene Prioritization

This article has 1 author:
1. Abduxoliq Ashuraliyev
This article has no evaluationsLatest version Dec 22, 2025
Uncovering miRNA–Disease Associations Through Graph Based Neural Network Representations

This article has 1 author:
1. Alessandro Orro
This article has no evaluationsLatest version Jan 28, 2026
Protein Language Models Rescue Variant Pathogenicity Prediction in Intrinsically Disordered Regions Through Synergistic Integration with Structure-Based Methods

This article has 1 author:
1. Hayden Farquhar
This article has no evaluationsLatest version Feb 4, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Path-Probability Models Outperform Point-Estimate Scores for Noncoding GWAS Gene Prioritization

Uncovering miRNA–Disease Associations Through Graph Based Neural Network Representations

Protein Language Models Rescue Variant Pathogenicity Prediction in Intrinsically Disordered Regions Through Synergistic Integration with Structure-Based Methods