A genotype-phenotype transformer to assess and explain polygenic risk

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Log in to save this article

Abstract

Genome-wide association studies have linked millions of genetic variants to biomedical phenotypes, but their utility has been limited by lack of mechanistic understanding and widespread epistatic interactions. Recently, Transformer models have emerged as powerful machine learning architectures with potential to address these and other challenges. Here we introduce the Genotype-to-Phenotype Transformer (G2PT), a framework for modeling hierarchical information flow among variants, genes, multigenic systems, and phenotypes. As proof-of-concept, we train G2PT to model the genetics of metabolic traits including insulin resistance (serum triglycerides-to-HDL ratio), LDL and type-2 diabetes. G2PT predicts these traits with accuracy exceeding state-of-the-art and, unlike other polygenic models, extends to distinct populations not used for training. Predictions of insulin resistance are based on >1,395 variants within 20 systems and include epistatic interactions among variants, e.g. between APOA4 and CETP in phospholipid transfer. This work positions hierarchical graph transformers as a next-generation approach to polygenic risk.

Article activity feed

  1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

    Learn more at Review Commons


    Reply to the reviewers

    Response to Reviewer 1:

    The authors introduce G2PT, a hierarchical graph transformer model that integrates genetic variants (SNPs), gene annotations, and multigenic systems (Gene Ontology) to predict and interpret complex traits.

    We thank the reviewer for this accurate summary of our approach and contributions.

    Major Comments:

    Comment 1-1. Insufficient Specification of Model Architecture: The description of the "hierarchical graph transformer" lacks technical depth. Key implementation details are missing: how node embeddings are initialized for SNPs, genes, and systems; how graph connectivity is defined at each level (e.g., adjacency matrices used in Equations 5-9, the sparsity); justification for the choice of embedding dimension and number of attention heads, including any sensitivity analysis; and the architecture of the feed-forward neural networks (e.g., number of layers, activation functions, and hidden dimensions).

    __Reply 1-1. __As requested, we have expanded the technical description of the model architecture, including the hierarchical graph transformer (HiGT), in the Materials and Methods section. Details regarding node initialization and hierarchical connectivity are now included in the new paragraph "Model Initialization and Graph Construction." Specifically, all node embeddings corresponding to SNPs, genes, and ontology-defined systems are initialized using uniform Xavier initialization (Glorot and Bengio, 2010).

    We have also clarified our hyperparameter optimization strategy. Learning rate, weight decay, hidden (embedding) dimension, and the number of attention heads were selected via grid search, as summarized in new Supplementary Fig. 8, reproduced below. Based on both performance and computational efficiency, we adopted four attention heads-consistent with the configuration commonly used in academic transformer models (Vaswani et al., 2017) (the original Transformer used eight).

    Regarding the feed-forward neural network, we follow the standard Transformer architecture consisting of two position-wise layers with hidden dimension four times larger than the node embedding size and a GeLU nonlinear activation function (Hendrycks and Gimpel, 2016). This configuration is widely established in the literature and functions as an intermediate processing step following attention; therefore, it is not a focus of hyperparameter tuning. All corresponding updates have been incorporated into the revised Methods section for clarity and completeness.

    Comment 1-2. No Simulation Studies to Validate Epistasis Detection: The ground truth epistasis interaction should use the ones that have been manually validated by literature. The central claim of discovering epistatic interactions relies heavily on the model's attention mechanism and downstream statistical filtering. However, no simulation studies are presented to validate that G2PT can reliably detect epistasis when ground-truth interactions are known. Demonstrating robust detection of non-additive interactions under varying genetic architectures and noise levels in simulated genotype-phenotype datasets is essential to substantiate the method's core capability.

    Reply 1-2. We agree that a simulation of epistasis detection using the G2PT model is a worthy addition to the manuscript. Accordingly, we have now incorporated a new section in the Results titled "Validation of Epistasis through Simulation Studies", which includes two new figures reproduced below (Supplementary Fig. 6 and Fig. 5). We have also added a new Methods section to describe this simulation study under the heading "Epistasis Simulation". These simulation studies show that G2PT recovers epistatic gene pairs with high fidelity when these pairs are coherent with the systems ontology (c.f. 'ontology coherence' in Supplementary Fig. 6, which reflects the probability that both SNPs are assigned to the same leaf system). Furthermore, G2PT outcompetes previous tools, such as PLINK-epistasis, which do not use knowledge of the systems hierarchy in the same way (Supplementary Fig 6b-d). Using simulation parameters consistent with current genome-wide association studies (n = 400,000) and understanding of heritability (h2 = 0.3 to 0.5) (Bloom et al. 2015; Speed and Evans 2023), we find that approximately 10% of all epistatic SNP pairs can be recovered at a precision of 50% (Fig. 5). We have provided the source code for this simulation study in our GitHub repository (https://github.com/idekerlab/G2PT/blob/master/Epistasis_simulation.ipynb)

    Comment 1-3. Lack of Justification for Model Complexity and Missing Ablation Insights: While Supplementary Figure 2 presents ablation studies, the manuscript needs to justify the high computational cost (168 GPU hours using 4×A30 GPUs) of the full model. It remains unclear how much performance gain is specifically due to reverse propagation (Equations 8-9), which is claimed to capture biological context. The benefit of using a full Gene Ontology hierarchy versus a flat system list is not quantified. There is also no comparison between bidirectional versus unidirectional propagation. Overall, the added complexity is not empirically shown to be necessary

    Reply 1-3. We thank the reviewer for prompting a clearer justification of complexity and ablations. We have now revised the Results to (i) quantify the specific value of the ontology and reverse propagation, and (ii) explain why a flat SNP→system model is computationally and biologically sub-optimal. We have added new ablation results to compare bidirectional (forward+reverse) versus forward-only propagation. Reverse propagation has little effect when epistatic pairs are within one system (ontology coherence ρ=1.0) but substantially improves retrieval when interactions span related systems (e.g., ρ≈0.8) (Figure reproduced below) A flat design scores a dense genes×systems map, ignoring known sparsity (sparse SNP→gene assignments; sparse ontology edges) and losing multi-scale context; our hierarchical formulation restricts computation to observed edges (SNP→gene→system) and aggregates signals across levels, yielding better efficiency and biological fidelity.

    Comment 1-4. Non-Equivalent Benchmarking Against PRS Methods: Figure 2 compares G2PT to polygenic risk score (PRS) methods such as LDpred2 and Lassosum, but G2PT is run only on SNPs pre-filtered by marginal association (p-values between 10⁻⁵ and 10⁻⁸), while the PRS methods use genome-wide SNPs. This introduces a strong bias in G2PT's favor by effectively removing noise. A fair comparison would require: (a) running LDpred2 and Lassosum on the same pre-filtered SNP sets as G2PT, or (b) running G2PT on genome-wide or LD-pruned SNP sets. The reported superior performance of G2PT may be driven primarily by this input filtering, not the model architecture.

    Reply 1-4. We appreciate the reviewer's concern regarding benchmarking equivalence. In response, we have extended our analyses to include PRS-CS (Ge et al., 2019) and SBayesRC (Zheng et al., 2024), two state-of-the-art Bayesian shrinkage methods comparable to LDpred2 and Lassosum. Although we initially attempted to run LDpred2 and Lassosum under all SNP-filtering conditions, their computational requirements at UK Biobank scale proved prohibitively time consuming. We therefore focused on PRS-CS and SBayesRC, which offer similar modeling principles with greater computational tractability. These methods have now been run at matched SNP-filtering conditions to our original study. The new results demonstrate that G2PT consistently outperforms PRS-CS and SBayesRC (new Fig. 2, reproduced below), indicating that its performance advantage is not solely attributable to SNP pre-filtering but also to its hierarchical attention-based architecture.

    Comment 1-5: No Details on Hyperparameter Optimization: Although the manuscript mentions grid search for hyperparameter tuning, it provides no information about which parameters were optimized (e.g., learning rate, dropout rate, weight decay, attention dropout, FFNN dimensions), what search space was explored, or what final values were selected. There is also no assessment of how sensitive the model's performance is to these choices. Better transparency would help facilitate reproducibility

    Reply 1-5. We agree with the reviewer and have expanded the manuscript to include full details of hyperparameter optimization. As described in the revised Methods section, we performed a grid search over learning rate {10−3,10−4,10−5} hidden dimension {64,128} and weight decay {0,10−5,10−3}. The results, summarized in __Supplementary Fig. 8 __(reproduced above), show that model performance is most sensitive to the learning rate, while hidden dimension and weight decay exert more moderate effects. Based on these findings, we selected a learning rate of 10−5, hidden dimension of 64, and weight decay of 10−3 for all subsequent experiments. Although a hidden dimension of 128 slightly improved performance, we adopted 64 to balance predictive accuracy with computational efficiency.

    Comment 1-6. Absence of Control for Key Confounders: In interpreting attention scores as reflecting genetic relevance (e.g., the role of the immunoglobulin system), the model includes only age, sex, and genetic principal components as covariates. Important confounders such as BMI, alcohol use, or medication (e.g., statins) have not been controlled for. Since TG/HDL levels are strongly influenced by environment and lifestyle, it is entirely plausible that some high-attention features reflect environmental tagging, not biological causality.

    Reply 1-6. In the current framework, we included age, sex, and genetic principal components to account for demographic and population-structure effects, focusing on genetic contributions within a controlled baseline. We acknowledge that non-genetic covariates can influence downstream biological states and may indirectly shape attention at the gene or system level. Accurately modeling such effects requires an extended framework where environmental variables directly modulate gene and system embeddings rather than being implicitly absorbed by the attention mechanism. We have clarified these limitations in the Discussion along with plans to incorporate explicit confounder modeling in future extensions of G2PT.

    Comment 1-7. Oversimplified Treatment of SNP-to-Gene Mapping: The SNP-to-gene mapping strategy combines cS2G, eQTL, and nearest-gene annotations, but the limitations of this approach are not adequately addressed. The manuscript does not specify how conflicts between methods are resolved or what fraction of SNPs map ambiguously to multiple genes. Supplementary Figure 2 shows model performance degrades when using only nearest-gene mapping, but there is no systematic analysis of how mapping uncertainties propagate through the hierarchy and affect attention or interpretation.

    Reply 1-7. In the revision (Results), we have clarified how conflicts between cS2G, eQTL, and nearest-gene annotations are resolved, and we have reported the proportion of SNPs that map to multiple genes across these three annotation approaches. We note that the hierarchical attention mechanism enables the model to prioritize among alternative gene mappings in a data-driven manner, and this is a major strength of the approach. As shown in __Fig. 3 __(Results, reproduced below), SNP-to-gene attention weights reveal dominant linkages, reducing the impact of mapping uncertainty on interpretation. We now explicitly describe this mechanism and acknowledge that further work in probabilistic mapping and fine-mapping approaches is a valuable future direction for improving resolution and interpretability.

    "For SNPs with several potential SNP-to-gene mappings (Methods), we found that G2PT often prioritized one of these genes in particular due to its membership in a high-attention system. For example, the chr11q23.3 locus contains multiple genes including the APOA1/C3/A4/A5 gene cluster (Fig. 3c) which is well-known to govern lipid transport, an important system for G2PT predictions (Fig. 3a). Due to high linkage disequilibrium in the region, all of its associated SNPs had multiple alternative gene mappings available. For example, SNP rs1145189 mapped not only to APOA5 but to the more proximal BUD13, a gene functioning in spliceosomal assembly (a system receiving substantially lower G2PT attention). Here, the relevant information flow learned by G2PT was from rs1145189 to APOA5 to lipid transport and protein-lipid complex remodeling (Fig. 3c; and conversely, deprioritizing BUD13 as an effector gene for TG/HDL). We found that this particular genetic flow was corroborated by exome sequencing, which implicates APOA5 but not BUD13 in regulation of TG/HDL, using data that were not available to G2PT. Similarly, two other SNPs at this locus - rs518547 and rs11216169 - had potential mappings to their closest gene SIK3, where they reside within an intron, but also to regulatory elements for the more distant lipid transport genes APOC3 and APOA4. Here, G2PT preferentially weighted the mappings to APOC3 and APOA4 rather than to SIK3 (Fig. 3c)."

    Comment 1-8. Naive Scoring of System Importance: The method used to quantify the biological relevance of systems (i.e., correlating attention scores with predicted phenotype values) risks circular reasoning. Since the model is trained to optimize prediction, systems that contribute strongly to prediction will naturally show high correlation-even if they are not biologically causal. No comparison is made with established gene set enrichment methods applied to GWAS summary statistics. The approach lacks an independent benchmark to validate that the "important" systems are biologically meaningful.

    Reply 1-8. As requested, we compared G2PT's system-level importance scores with results from MAGMA competitive gene-set analysis, an established enrichment approach. This analysis indeed shows significant correlation between the systems identified by the two approaches (ρ = 0.26, p .01; Supplementary Table. 2), reflecting a shared emphasis on canonical lipid processes. We also observed systems detected by G2PT but not strongly detected by MAGMA's linear enrichment model-for example, the lipopolysaccharide-mediated signaling pathway (Kalita et al. 2022)

    Comment 1-9. No External Validation to Assess Generalizability. All evaluations are performed using cross-validation within the UK Biobank. There is no assessment of generalizability to independent cohorts or diverse ancestries. Given population structure, genotyping platform, and phenotype measurement variability, external validation is essential before claiming the method is suitable for broader use in polygenic risk assessment.

    Reply 1-9. To externally validate the G2PT model requires individual level genotype data with paired TG/HDL measurements, sample size at the scale of the UK Biobank, and GPU access to this data. Thus, we approached the All of Us program, a large and diverse cohort with individual level data and T2D conditions with HbA1C measurements. We first processed the All of Us genotype and phenotype data as we had processed UKBB data (Methods), resulting in 41,849 participants with T2D and 80,491 without T2D across various ethnicities. We then transferred the trained T2D G2PT model to the AoU Workbench and evaluated its performance. The model demonstrated robust discriminative capability with an explained variance of 0.025, as shown in the new Fig. 2d, (reproduced above).

    Comment 1-10. Computational Burden and Scalability Are Not Addressed: The paper notes that training the model requires 168 GPU hours on 4×A30 GPUs for just ~5,000 SNPs. However, there is no discussion of whether G2PT can scale to larger SNP sets (e.g., genome-wide imputed data) or more complex biological hierarchies (e.g., Reactome pathways). Without addressing scalability, the model's applicability to real-world, large-scale genomic datasets remains unclear.

    Reply 1-10. We have addressed scalability with both engineering optimizations and new scalability experiments. First, we refactored the model to use the xFormer memory-efficient attention for the hierarchical graph transformer (Lefaudeux et al., 2022), which also helps full parallelization of training, reducing bottlenecks. Second, we added a scaling study with progressively increasing SNP count. On 4×A30 GPUs, end-to-end training time for the 5k-SNP setting decreased from 4000 to 400 min. (approximately 7 GPU-hours, ×10). These new results are given in Supplementary Fig. 7, reproduced below.

    Minor Comment:

    Comment 1-11. Attention Weights as Mechanistic Insight: The paper equates high attention scores with biological importance, for example in highlighting the immunoglobulin system. There is no causal validation showing that altering the highlighted SNPs, genes, or systems has an actual effect on TG/HDL. Attention weights in transformer models are known to sometimes reflect spurious correlations, especially in high-dimensional settings. The correlation between attention scores and predictions (Supplementary Fig. 3a,b) does not constitute biological evidence. The interpretability claims can be restated without supporting functional or causal validation.

    Reply 1-11. We thank the reviewer for this thoughtful comment. We agree that attention weights are not causal evidence. In the revision, we (1) reframe attention-based findings as hypothesis-generating rather than mechanistic, and (2) add an explicit limitation noting that correlations between attention scores and predictions do not constitute biological validation.

    Response to Reviewer 2:

    This manuscript describes the introduction of the Genotype-to-Phenotype Transformer (G2PT), described by the authors as "a framework for modeling hierarchical information flow among variants, genes, multigenic systems, and phenotypes." The authors used the ratio TG/HDL as a trait for proof of concept of this tool.

    This is a potentially interesting computational tool of interest to bioinformaticians, computational genomicists, and biologists.

    We thank the reviewer for their overall positive assessment of our study.

    Comment 2-1. The rationale for choosing the TG/HDL ratio for this proof of concept analysis is not well justified beyond it being a marker for insulin resistance. Overall the use of a ratio may be problematic (see below). Analyses of TG and HDL separately as individual quantitative traits would be of interest. And an analysis of a dichotomous clinical trait (T2DM or CAD) would also be of great interest.

    Reply 2-1. We thank the reviewer for this suggestion. In the revised manuscript, we have expanded our analyses beyond the TG/HDL ratio to include TG and HDL as individual quantitative traits (Fig. 2, reproduced below). These additional analyses demonstrate that G2PT captures predictive signals robustly across each lipid component, not solely through their ratio. Furthermore, to address the reviewer's interest in clinical outcomes, we incorporated an analysis of type 2 diabetes (T2D) as a dichotomous trait of direct clinical relevance. Collectively, these results strengthen the rationale for our chosen phenotype and show that the G2PT framework generalizes effectively across quantitative and binary traits, consistently outperforming advanced PRS and machine learning benchmarks.

    Comment 2-2. The approach to mapping SNPs to genes does not incorporate the most advanced approaches. This should be described in more detail.

    Reply 2-2. We agree that the choice of SNP-to-gene mapping materially affects both performance and interpretability-indeed, our epistasis simulations suggest that more accurate mappings can improve recovery and localization. In this proof-of-concept work we use a straightforward, modular mapping sufficient to demonstrate the modeling framework, and we have clarified this in the Methods. The architecture is designed to plug-and-play alternative SNP-to-gene maps (e.g., eQTL/colocalization-based assignments, promoter-capture Hi-C). A dedicated follow-up study will systematically compare these alternatives and quantify their impact on attribution and downstream discovery.

    Comment 2-3. The example of gene prioritization at the A1/C3/A4/A5 gene locus is not particularly illuminating, as the prioritized genes are already well-known to influence TG and HDL-C levels and the TG/HDL ratio. Can the authors provide an example where G2PT prioritized a gene at a locus that is not already a well-known regulator of TG and HDL metabolism?

    Reply 2-3. We thank the reviewer for this suggestion. We have revised the manuscript to de-emphasize the well-established APOA1 locus and instead highlight the less expected "Positive regulation of immunoglobulin production" system (Figure 3a,b, Discussion). Here our model prioritizes the gene TNFSF13 based on specific variants that are not previously associated with TG or HDL (e.g., rs5030405, rs1858406, shown in blue). This finding points to an intriguing, non-canonical link between B-cell regulation and lipid metabolism. While full exploration of this finding is beyond the scope of the present methods paper, this example demonstrates G2PT's ability to identify novel, high-priority candidates in atypical systems.

    Comment 2-4. The identification of epistatic interactions is a potentially interesting application of G2PT. However, suppl table 1 shows a very limited number of such interactions with even fewer genes, and most of these are well established biological interactions (such as LPL/apoA5). The TGFB1 and FKBP1A interaction is interesting and should be discussed. What is needed for increasing the number of potential interactions, greater power?

    Reply 2-4. We are glad the reviewer appreciates the use of the G2PT model to identify epistatic interactions. We have now discussed a potential mechanism of epistasis between TGFB1 and FKBP1A in the protein dephosphorylation system (Discussion). In addition, we have addressed the reviewer's question about statistical power through extensive epistasis simulations (Fig. 5 and Supplementary Fig. 6), which show that G2PT's detection ability scales strongly with sample size-1,000 samples are insufficient, performance improves at 5,000, and power becomes reliable at 100,000. Realistic simulations (Fig. 5b-d) further demonstrate that under biologically plausible architectures, G2PT can robustly recover specific interactions even within complex genetic backgrounds

    Comment 2-5. Furthermore, the use of the TG/HDL ratio for the assessment of epistatic interactions may be problematic. For example, if one SNP affected only TG and the other only HDL-C, it would appear to be an epistatic interaction with regard to the ratio, although the biological epistasis may be limited to non-existent.

    Reply 2-5. We have greatly expanded the example phenotypes modeled in our study, Please see our reply 2-1 above.

    Response to Reviewer 3:

    This manuscript by Lee et al provides a sensible and powerful approach to polygenic score prediction. The model aggregates information from SNPs to genes to systems, using a transformer based architecture, which appears to increase predictive performance, produce interpretable outputs of genes and systems that underlie risk, and identify candidates for epistasis tests.

    I think the manuscript is clear and well written, and conducted via state-of-the-art approaches. I don't have any concerns regarding the claims that are made.

    We thank the reviewer for their very positive assessment of our study.

    Major comments:

    Comment 3-1. Specifically, lipid based traits are perhaps the most well-powered and the most biologically coherent; they are also very well-studied biologically and thus overrepresented in the gene ontology. It is unclear whether this approach will work as well for a trait like Schizophrenia for which the underlying pathways are not as well captured in existing ontologies. The authors anticipate this in their limitations section, and I am not expecting them to solve every issue with this, but it would be nice to expand the testing a little bit beyond only this one trait.

    Reply 3-1. We appreciate the reviewer's suggestion to expand beyond a single lipid trait. In the revised manuscript, we have included analyses of additional phenotypes, including low-density lipoprotein (LDL) and T2D (Fig. 2). These additions demonstrate the broader applicability of our framework beyond a single trait class.

    Comment 3-2. It also seems like the authors have not compared their method to the truly latest PRS methods, such as PRS-CSx and SBayesR. I would suggest adding some of the methods shown to be the best from this recent paper: https://www.nature.com/articles/s41598-025-02903-1

    Reply 3-2. We agree these are important comparators. Accordingly, we have extended our comparison to include PRS‑CS (Ge et al., 2019) and SBayesRC (Zheng et al., 2024), following its strong performance demonstrated in recent benchmarking studies (see Figure 2 above). We confirmed that G2PT outperforms advanced PRS methods for all TG/HDL ratio, LDL, and T2D phenotypes.

    Comment 3-3. Another major comment regards whether this method could be applied to traits with just GWAS summary statistics, rather than individual level data. This would not enable identification of specific methods underlying an individual, but it could still learn SNP based weights that could be mapped to genes and systems that could help explain risk when the model is applied to individuals (kind of like a pretraining step?)

    Reply 3-3. We appreciate this suggestion. While SNP weights from GWAS summary statistics could, in principle, serve as informative priors for attention values, incorporating them would require a sophisticated mathematical formulation that is beyond the scope of this study. Our current framework also relies on individual-level genotype and phenotype data to capture multilevel information flow and individual-specific variation.

    Minor comments:

    Comment 3-4. Why the need to constrain to a small number of SNPs? Is it just computational cost? If so, what would happen as power increases and more SNPs exceed the thresholds used?

    Reply 3-4. Yes, it's about computational cost, but we've now modified the code for improved computational efficiency. First, we refactored the model to use the xFormer memory-efficient attention for the hierarchical graph transformer (Lefaudeux et al., 2022), which also helps full parallelization of training, reducing bottleneck effects. Second, we added a scaling study of the impact of varying SNP count. On 4×A30 GPUs, end-to-end training time for the 5k-SNP setting decreased from 65 hours to 7 GPU-hours (×9). We expect performance can potentially increase if more SNPs are provided to the model based on Fig. 2 (reproduced above). With the optimized implementation, users can raise SNP thresholds as power increases; the expected behavior is improved accuracy up to a plateau, while hierarchical sparsity maintains training tractability and ensures well-regularized results.

    Comment 3-5. What type of sample size/power does this method require to work well? If others were to use it, how many SNPs/samples would be needed to obtain good performance?

    Reply 3-5. To address this comment, we quantified performance as a function of training size by subsampling the cohort and retraining G2PT with identical architecture and SNP set. New __Supplementary Fig. 3 __(reproduced below) shows monotonic gains with sample size across three representative phenotypes. We found that stable performance is reached by ~100k samples. These trends hold for continuous traits (TG/HDL, LDL) and more modestly for a binary trait (T2D), consistent with lower per-sample information for case-control settings.

  2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #3

    Evidence, reproducibility and clarity

    This manuscript by Lee et al provides a sensible and powerful approach to polygenic score prediction. The model aggregates information from SNPs to genes to systems, using a transformer based architecture, which appears to increase predictive performance, produce interpretable outputs of genes and systems that underlie risk, and identify candidates for epistasis tests.

    I think the manuscript is clear and well written, and conducted via state-of-the-art approaches. I don't have any concerns regarding the claims that are made.

    My two major comments regard a question about how well this will work when compared to other approaches for other traits besides TG:HDL. Specifically, lipid based traits are perhaps the most well-powered and the most biologically coherent; they are also very well-studied biologically and thus overrepresented in the gene ontology. It is unclear whether this approach will work as well for a trait like Schizophrenia for which the underlying pathways are not as well captured in existing ontologies. The authors anticipate this in their limitations section, and I am not expecting them to solve every issue with this, but it would be nice to expand the testing a little bit beyond only this one trait.

    Therefore, I would suggest that the authors test a limited number of additional traits that are not lipid based traits, and ideally not metabolic traits, to see how their model behaves. I would pick well-powered GWAS with a lot of associations but from a different phenotypic category

    It also seems like the authors have not compared their method to the truly latest PRS methods, such as PRS-CSx and SBayesR. I would suggest adding some of the methods shown to be the best from this recent paper: https://www.nature.com/articles/s41598-025-02903-1

    Another major comment regards whether this method could be applied to traits with just GWAS summary statistics, rather than individual level data. This would not enable identification of specific methods underlying an individual, but it could still learn SNP based weights that could be mapped to genes and systems that could help explain risk when the model is applied to individuals (kind of like a pretraining step?)

    Other minor comments:

    Why the need to constrain to a small number of SNPs? Is it just computational cost? If so, what would happen as power increases and more SNPs exceed the thresholds used?

    What type of sample size/power does this method require to work well? If others were to use it, how many SNPs/samples would be needed to obtain good performance?

    Will this work just as well for binary diseases? Is this a straightforward extension of the method or does it require more work?

    Since I think a lot of geneticists will read it, more intuition as to how attention weights map to parameters geneticists think about would be useful, in particular how the graphics in Fig 3 are made (this may be second nature to ML experts but may not be obvious to statistical geneticists)

    The authors claim that G2PT identifies epistatic interactions. Is this true or does it just identify pairs of SNPs that could be subsequently tested for epistasis?

    Significance

    This study does a great job of marrying the latest (interesting) technologies in AI/ML with a specific problem in statistical genetics. The clarity of presentation and interpretability of the model are strong. The main areas for improvement are to clarify how general this approach is -- will it work for other traits, is it truly better than the latest PRS methods, and what are the specifics of the GWAS it requires (sample size, individual-level data, power, type of trait)

    I think the main advance is therefore currently conceptual, but not yet practical, unless more performance comparisons were done.

    It seems like the main audience would be geneticists, since I suspect most AI/ML researchers are familiar with this type of approach. If there are fundamental innovations in applying transformers in this specific way to genetics, that would be good to highlight in more depth.

    My expertise: statistical genetics and computer science, familiar with DNNs but not a practitioner in them.

  3. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #2

    Evidence, reproducibility and clarity

    This manuscript describes the introduction of the Genotype-to-Phenotype Transformer (G2PT), described by the authors as "a framework for modeling hierarchical information flow among variants, genes, multigenic systems, and phenotypes." The authors used the ratio TG/HDL as a trait for proof of concept of this tool.

    Specific comments:

    1. The rationale for choosing the TG/HDL ratio for this proof of concept analysis is not well justified beyond it being a marker for insulin resistance. Overall the use of a ratio may be problematic (see below). Analyses of TG and HDL separately as individual quantitative traits would be of interest. And an analysis of a dichotomous clinical trait (T2DM or CAD) would also be of great interest.
    2. The approach to mapping SNPs to genes does not incorporate the most advanced approaches. This should be described in more detail.
    3. The example of gene prioritization at the A1/C3/A4/A5 gene locus is not particularly illuminating, as the prioritized genes are already well-known to influence TG and HDL-C levels and the TG/HDL ratio. Can the authors provide an example where G2PT prioritized a gene at a locus that is not already a well-known regulator of TG and HDL metabolism?
    4. The identification of epistatic interactions is a potentially interesting application of G2PT. However, suppl table 1 shows a very limited number of such interactions with even fewer genes, and most of these are well established biological interactions (such as LPL/apoA5). The TGFB1 and FKBP1A interaction is interesting and should be discussed. What is needed for increasing the number of potential interactions, greater power?
    5. Furthermore, the use of the TG/HDL ratio for the assessment of epistatic interactions may be problematic. For example, if one SNP affected only TG and the other only HDL-C, it would appear to be an epistatic interaction with regard to the ratio, although the biological epistasis may be limited to non-existent.

    Significance

    This is a potentially interesting computational tool of interest to bioinformaticians, computational genomicists, and biologists.

    The proof of concept offered here using a single ratio is not sufficient to conclude its potential utility.

    My expertise is in genetics and molecular mechanisms of lipid metabolism.

  4. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #1

    Evidence, reproducibility and clarity

    The authors introduce G2PT, a hierarchical graph transformer model that integrates genetic variants (SNPs), gene annotations, and multigenic systems (Gene Ontology) to predict and interpret complex traits.

    Major Comments:

    1. Insufficient Specification of Model Architecture: The description of the "hierarchical graph transformer" lacks technical depth. Key implementation details are missing: how node embeddings are initialized for SNPs, genes, and systems; how graph connectivity is defined at each level (e.g., adjacency matrices used in Equations 5-9, the sparsity); justification for the choice of embedding dimension and number of attention heads, including any sensitivity analysis; and the architecture of the feed-forward neural networks (e.g., number of layers, activation functions, and hidden dimensions).
    2. No Simulation Studies to Validate Epistasis Detection: The ground truth epistasis interaction should use the ones that have been manually validated by literature. The central claim of discovering epistatic interactions relies heavily on the model's attention mechanism and downstream statistical filtering. However, no simulation studies are presented to validate that G2PT can reliably detect epistasis when ground-truth interactions are known. Demonstrating robust detection of non-additive interactions under varying genetic architectures and noise levels in simulated genotype-phenotype datasets is essential to substantiate the method's core capability.
    3. Lack of Justification for Model Complexity and Missing Ablation Insights: While Supplementary Figure 2 presents ablation studies, the manuscript needs to justify the high computational cost (168 GPU hours using 4×A30 GPUs) of the full model. It remains unclear how much performance gain is specifically due to reverse propagation (Equations 8-9), which is claimed to capture biological context. The benefit of using a full Gene Ontology hierarchy versus a flat system list is not quantified. There is also no comparison between bidirectional versus unidirectional propagation. Overall, the added complexity is not empirically shown to be necessary.
    4. Non-Equivalent Benchmarking Against PRS Methods: Figure 2 compares G2PT to polygenic risk score (PRS) methods such as LDpred2 and Lassosum, but G2PT is run only on SNPs pre-filtered by marginal association (p-values between 10⁻⁵ and 10⁻⁸), while the PRS methods use genome-wide SNPs. This introduces a strong bias in G2PT's favor by effectively removing noise. A fair comparison would require: (a) running LDpred2 and Lassosum on the same pre-filtered SNP sets as G2PT, or (b) running G2PT on genome-wide or LD-pruned SNP sets. The reported superior performance of G2PT may be driven primarily by this input filtering, not the model architecture.
    5. No Details on Hyperparameter Optimization: Although the manuscript mentions grid search for hyperparameter tuning, it provides no information about which parameters were optimized (e.g., learning rate, dropout rate, weight decay, attention dropout, FFNN dimensions), what search space was explored, or what final values were selected. There is also no assessment of how sensitive the model's performance is to these choices. Better transparency would help facilitate reproducibility
    6. Absence of Control for Key Confounders: In interpreting attention scores as reflecting genetic relevance (e.g., the role of the immunoglobulin system), the model includes only age, sex, and genetic principal components as covariates. Important confounders such as BMI, alcohol use, or medication (e.g., statins) have not been controlled for. Since TG/HDL levels are strongly influenced by environment and lifestyle, it is entirely plausible that some high-attention features reflect environmental tagging, not biological causality.
    7. Oversimplified Treatment of SNP-to-Gene Mapping: The SNP-to-gene mapping strategy combines cS2G, eQTL, and nearest-gene annotations, but the limitations of this approach are not adequately addressed. The manuscript does not specify how conflicts between methods are resolved or what fraction of SNPs map ambiguously to multiple genes. Supplementary Figure 2 shows model performance degrades when using only nearest-gene mapping, but there is no systematic analysis of how mapping uncertainties propagate through the hierarchy and affect attention or interpretation.
    8. Naive Scoring of System Importance: The method used to quantify the biological relevance of systems (i.e., correlating attention scores with predicted phenotype values) risks circular reasoning. Since the model is trained to optimize prediction, systems that contribute strongly to prediction will naturally show high correlation-even if they are not biologically causal. No comparison is made with established gene set enrichment methods applied to GWAS summary statistics. The approach lacks an independent benchmark to validate that the "important" systems are biologically meaningful.
    9. No External Validation to Assess Generalizability: All evaluations are performed using cross-validation within the UK Biobank. There is no assessment of generalizability to independent cohorts or diverse ancestries. Given population structure, genotyping platform, and phenotype measurement variability, external validation is essential before claiming the method is suitable for broader use in polygenic risk assessment.
    10. Computational Burden and Scalability Are Not Addressed: The paper notes that training the model requires 168 GPU hours on 4×A30 GPUs for just ~5,000 SNPs. However, there is no discussion of whether G2PT can scale to larger SNP sets (e.g., genome-wide imputed data) or more complex biological hierarchies (e.g., Reactome pathways). Without addressing scalability, the model's applicability to real-world, large-scale genomic datasets remains unclear.

    Minor:

    1. Attention Weights as Mechanistic Insight: The paper equates high attention scores with biological importance, for example in highlighting the immunoglobulin system. There is no causal validation showing that altering the highlighted SNPs, genes, or systems has an actual effect on TG/HDL. Attention weights in transformer models are known to sometimes reflect spurious correlations, especially in high-dimensional settings. The correlation between attention scores and predictions (Supplementary Fig. 3a,b) does not constitute biological evidence. The interpretability claims can be restated without supporting functional or causal validation.

    Significance

    Novelty

    This work presents novelty by introducing the first transformer-based model that integrates the GO hierarchy to enable bidirectional mapping between genotype and phenotype. Additionally, the use of attention mechanisms to screen for epistasis offers a novel and computationally efficient alternative to traditional exhaustive SNP-SNP interaction tests.

    Impact

    Target Audience

    • Specialized: Computational biologists working on interpretable machine learning methods in genomics.
    • Broader: Geneticists investigating polygenic traits and drug developers focusing on pathway-level therapeutic targets.

    Limitations vs. Contributions

    While the work presents a clear conceptual advance by incorporating hierarchical biological priors and attention mechanisms, the technical contribution is somewhat limited by its validation on a single trait and the absence of simulation-based benchmarking. Nevertheless, the framework shows potential if extended to other traits and experimentally validated.

    Overall Assessment

    Recommendation: Major Revision

    Strengths:

    • Predictive performance appears strong.
    • The use of biological priors enables interpretability at the pathway level.

    Major Weaknesses:

    • The current validation is limited to a single trait, restricting generalizability.
    • The manuscript lacks a complete and clear description of the model architecture.
    • No simulations are provided to assess the method's ability to recover known epistatic interactions or pathways.

    Reviewer Expertise: Machine learning applications in genomics and genetics.