GenoGraph: An Interpretable Graph Contrastive Learning Approach for Identifying Breast Cancer Risk Variants
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Genome-wide association studies (GWASs) have identified genetic risk variants associated with breast cancer. However, traditional GWAS methodologies, which typically analyze variants independently one by one, often fail to capture the complex genetic interactions underlying disease susceptibility. Machine and deep learning approaches present promising alternatives by modeling these interactions, yet they encounter challenges, such as overfitting—due to the vast number of genetic variants (~10 million) combined with limited sample sizes—and interpretability. While conventional statistical methods offer the advantages of computational efficiency and robustness, their ability to predict individual risk remains suboptimal. In contrast, deep learning methods can achieve higher predictive accuracy but are often prone to overfitting and instability. In this study we propose GenoGraph, a graph-based contrastive learning approach, which addresses high-dimensional and low-sample-size scenarios in genetic data analysis. GenoGraph leverages a supervised learning strategy that exploits the relationships between individuals and their corresponding labels (case vs controls), thus guiding the network to improve its variant representation learning. In addition, a self-attention module is incorporated within the GenoGraph framework to provide interpretability of the learned representations. GenoGraph demonstrates superior performance on an independent test set, using the BioBank of Eastern Finland dataset, improving the case-control classification accuracy by 36% compared with traditional GWAS techniques (χ²-test) and surpassing existing graph-based approaches, such as GRACE, by 5.89% relative improvement in terms of AUC-ROC. In the Finnish population, GenoGraph successfully identified 2500 pivotal genetic variants, among which 370 are significantly associated with breast cancer based on GWAS significance (𝑝-value < 5 × 10−8 ). These variants were further validated for biological relevance via a comprehensive bioinformatics pipeline. GenoGraph identified rs11672773 (p-value: 2.22 × 10−18, OR: 2.76 [2.19–3.46]) in ZNF8 as a key risk variant significantly associated with breast cancer in Finnish individuals. rs11672773 demonstrated strong variant-variant interaction strengths with rs10759243 (p-value:1.59 × 10−20, OR: 3.07 [2.42–3.89]) in KLF4 and with rs3803662 (p-value:1.12 × 10−22, OR: 0.34 [0.27–0.42]) in TOX3, with interaction coefficients of 0.99 and 0.92, respectively. Both interacting variants have previously been implicated in breast cancer risk.