Risk factors affecting polygenic score performance across diverse cohorts
Curation statements for this article:
Curated by eLife
eLife assessment
This study presents a convincing analysis of the effects of covariates, such as age, sex, socioeconomic status, or biomarker levels, on the predictive accuracy of polygenic scores for body mass index; The work is further supported by important approaches for improving prediction accuracy by accounting for such covariates across a variety of association studies. The authors did a commendable job addressing reviewer suggestions and comments. The work will be of interest to colleagues using and developing methods for phenotypic prediction based on polygenic scores.
This article has been Reviewed by the following groups
Listed in
 Evaluated articles (eLife)
Abstract
Apart from ancestry, personal or environmental covariates may contribute to differences in polygenic score (PGS) performance. We analyzed effects of covariate stratification and interaction on body mass index (BMI) PGS (PGS BMI ) across four cohorts of European (N=491,111) and African (N=21,612) ancestry. Stratifying on binary covariates and quintiles for continuous covariates, 18/62 covariates had significant and replicable R 2 differences among strata. Covariates with the largest differences included age, sex, blood lipids, physical activity, and alcohol consumption, with R 2 being nearly double between best and worst performing quintiles for certain covariates. 28 covariates had significant PGS BMI covariate interaction effects, modifying PGS BMI effects by nearly 20% per standard deviation change. We observed overlap between covariates that had significant R 2 differences among strata and interaction effects – across all covariates, their main effects on BMI were correlated with their maximum R 2 differences and interaction effects (0.56 and 0.58, respectively), suggesting highPGS BMI individuals have highest R 2 and increase in PGS effect. Using quantile regression, we show the effect of PGS BMI increases as BMI itself increases, and that these differences in effects are directly related to differences in R 2 when stratifying by different covariates. Given significant and replicable evidence for contextspecific PGS BMI performance and effects, we investigated ways to increase model performance taking into account nonlinear effects. Machine learning models (neural networks) increased relative model R 2 (mean 23%) across datasets. Finally, creating PGS BMI directly from GxAge GWAS effects increased relative R 2 by 7.8%. These results demonstrate that certain covariates, especially those most associated with BMI, significantly affect both PGS BMI performance and effects across diverse cohorts and ancestries, and we provide avenues to improve model performance that consider these effects.
Article activity feed



Author response:
The following is the authors’ response to the original reviews.
We previously responded to reviewer comments in a previous iteration of this draft, edited the manuscript accordingly, and have no further comments on the majority of them. However, we performed additional analyses mainly in response to weaknesses Reviewer 1 highlighted related to “one shortcoming [being] the lack of a conceptual model explaining the results”, and the eLife assessment stating “the study falls short of providing a cogent interpretation of key findings, which could be of great interest and utility”. We provide a conceptual explanation that ties together many of our results, which we demonstrate using real data and further explore using simulated data – these analyses are in a new section titled “Increase in PGS effect for increasing …
Author response:
The following is the authors’ response to the original reviews.
We previously responded to reviewer comments in a previous iteration of this draft, edited the manuscript accordingly, and have no further comments on the majority of them. However, we performed additional analyses mainly in response to weaknesses Reviewer 1 highlighted related to “one shortcoming [being] the lack of a conceptual model explaining the results”, and the eLife assessment stating “the study falls short of providing a cogent interpretation of key findings, which could be of great interest and utility”. We provide a conceptual explanation that ties together many of our results, which we demonstrate using real data and further explore using simulated data – these analyses are in a new section titled “Increase in PGS effect for increasing percentiles of BMI itself, and its relation to R2 differences when stratifying by covariates”, with the Discussion also being updated accordingly.
Essentially, we demonstrate that the effect of PGSBMI increases as BMI itself increases (using quantile regression – newly created Figure 5). This finding helps explain the correlation between covariate main effects, interaction effects, and maximum R2 differences when stratifying on different covariates, and also why any one or combination of covariates did not seem to be of unusual interest. While this result readily explains why covariates with larger main effects have larger interaction effects, by itself it does not seem to explain the differences in R2 in covariatestratified bins, but we show using portions of real data and simulated data that in the case of this study they are closely related.
Effectively, as the effect of PGSBMI increases, variance in the phenotype will also increase – so long as the residuals do not increase proportionately, this causes R2 to also increase as R2 directly depends on outcome variance. We demonstrate this using simulated data (newly created S Figure 2) and real data (newly created S Figure 3). So the largest R2 differences between certain covariatestratified bins seems to be a direct consequence of those covariates also having the largest PGSBMI*covariate interaction effects. These results tie into our previous response to Reviewer 1, where essentially there is not only heteroskedasticity in the relationship between PGSBMI and BMI, but a cause of the heteroskedasticity is an increasing effect in PGSBMI as BMI itself increases.
In the Discussion, we highlight several broad implications of these findings. First, these results may, in part, provide a generalizable explanation for epistasis, as the effect of a PGS (or any individual SNP) seems to depend on phenotype, and as phenotype depends on many SNPs, the effect of PGS and individual SNPs depends on other SNPs. Second, these results may also provide a generalizable explanation for GxE, as, demonstrated in this paper, interaction effects for SNPs (or a PGS) may largely depend on the phenotypic value itself, rather than any specific environment(s) or combination of. Finally, related to our previous response to Reviewer 2, modeling effects of SNPs dependent on phenotype itself would almost certainly result in gains in PGS performance (and locus discovery), which should also be larger than e.g., just GxAge effects as we demonstrated in this manuscript.

eLife assessment
This study presents a convincing analysis of the effects of covariates, such as age, sex, socioeconomic status, or biomarker levels, on the predictive accuracy of polygenic scores for body mass index; The work is further supported by important approaches for improving prediction accuracy by accounting for such covariates across a variety of association studies. The authors did a commendable job addressing reviewer suggestions and comments. The work will be of interest to colleagues using and developing methods for phenotypic prediction based on polygenic scores.

Joint Public Review:
In this paper Hui and colleagues investigate how the predictive accuracy of a polygenic score (PGS) for body mass index (BMI) changes when individuals are stratified by 62 different covariates. After showing that the PGS has different predictive power across strata for 18 out of 62 covariates, they turn to understanding why these differences and seeing if predictive performance could be improved. First they investigated which types of covariates result in the largest differences in PGS predictive power, finding that covariates with with larger "main effects" on the trait and covariates with larger interaction effects (interacting with the PGS to affect the trait) tend to better stratify individuals by PGS performance. The authors then see if including interactions between the PGS and covariates improves predictive …
Joint Public Review:
In this paper Hui and colleagues investigate how the predictive accuracy of a polygenic score (PGS) for body mass index (BMI) changes when individuals are stratified by 62 different covariates. After showing that the PGS has different predictive power across strata for 18 out of 62 covariates, they turn to understanding why these differences and seeing if predictive performance could be improved. First they investigated which types of covariates result in the largest differences in PGS predictive power, finding that covariates with with larger "main effects" on the trait and covariates with larger interaction effects (interacting with the PGS to affect the trait) tend to better stratify individuals by PGS performance. The authors then see if including interactions between the PGS and covariates improves predictive accuracy, finding that linear models only result in modest increases in performance but nonlinear models result in more substantial performance gains.
Overall, the results are interesting and wellsupported. The results will be broadly interesting to people using and developing PGS methods, as well as the broader statistical genetics community.
A few of the important points of the paper are:
A major impediment to the clinical use of PGS is the interaction between the PGS and various other routinely measure covariates, and this work provides a very interesting empirical study along these lines. The problem is interesting, and the work presented here is a convincing empirical study of the problem.
The result that PGS accuracy differs across covariates, but in a way that is not wellcaptured by linear models with interactions is important for PGS method development.
The quantile regression analysis is an interesting approach to explore how and why PGS may differ in accuracy across different strata of individuals.



Author Response:
Reviewer #1 (Public Review):
In this paper, Hui and colleagues investigate how the predictive accuracy of a polygenic score (PGS) for body mass index (BMI) changes when individuals are stratified by 62 different covariates. After showing that the PGS has different predictive power across strata for 18 out of 62 covariates, they turn to understanding why these differences and seeing if predictive performance could be improved. First, they investigated which types of covariates result in the largest differences in PGS predictive power, finding that covariates with larger "main effects" on the trait and covariates with larger interaction effects (interacting with the PGS to affect the trait) tend to better stratify individuals by PGS performance. The authors then see if including interactions between the PGS and …
Author Response:
Reviewer #1 (Public Review):
In this paper, Hui and colleagues investigate how the predictive accuracy of a polygenic score (PGS) for body mass index (BMI) changes when individuals are stratified by 62 different covariates. After showing that the PGS has different predictive power across strata for 18 out of 62 covariates, they turn to understanding why these differences and seeing if predictive performance could be improved. First, they investigated which types of covariates result in the largest differences in PGS predictive power, finding that covariates with larger "main effects" on the trait and covariates with larger interaction effects (interacting with the PGS to affect the trait) tend to better stratify individuals by PGS performance. The authors then see if including interactions between the PGS and covariates improves predictive accuracy, finding that linear models only result in modest increases in performance but nonlinear models result in more substantial performance gains.
Overall, the results are interesting and wellsupported. The results will be broadly interesting to people using and developing PGS methods. Below I list some strengths and minor weaknesses.
Strengths:
A major impediment to the clinical use of PGS is the interaction between the PGS and various other routinely measured covariates, and this work provides a very interesting empirical study along these lines. The problem is interesting, and the work presented here is a convincing empirical study of the problem.
The result that PGS accuracy differs across covariates, but in a way that is not wellcaptured by linear models with interactions is important for PGS method development.
Thank you for all of the positive comments.
Weakness:
While arguably outside the scope of this paper, one shortcoming is the lack of a conceptual model explaining the results. It is interesting and empirically useful that PGS prediction accuracy differs across many covariates, but some of the results are hard to reconcile simultaneously. For example, it is interesting that triglyceride levels are associated with PGS performance across cohorts, but it seems like the effect on performance is discordant across datasets (Figure 2). Similarly, many of these effects have discordant (linear) interactions across cohorts (Figure 3). Overall it is surprising that the same covariates would be important but for presumably different reasons in different cohorts. Similarly, it would be good to discuss how the present results relate to the conceptual models in Mostafavi et al. (eLife 2020) and Zhu et al. (Cell Genomics 2023).
Thank you for the comments. We agree that more generalizable explanations would be useful, which may be worth exploring in future work. Specifically, if there is heteroskedasticity in the relationship between PGS and BMI (e.g., phenotypic variance increases for higher values of BMI while PGS variance does not, or at least by a different amount), then that may partially explain the performance differences when stratifying by covariates that have main effects on BMI – somewhat similarly to what is presented in Figure 2 of Mostafavi et al. Such results may imply that similar performance differences could occur when stratifying by the phenotype itself, although this still may not explain differences in PGS effects, and differences in performance when using nonlinear methods (such as in this work and in Figure 4 of Zhu et al.). While we observe discordant effects for certain covariates across datasets, the findings from the correlation analyses use all cohorts and ancestries, and we expect that these difference in effects across datasets may be due to differences in their relationship with BMI across datasets (triglyceride levels may be especially noisy due to their sensitivity to fasting which may have been controlled for differently across datasets).
Reviewer #2 (Public Review):
This work follows in the footsteps of earlier work showing that BMI prediction accuracy can vary dramatically by context, even within a relatively ancestrally homogenous sample. This is an important observation that is worth the extension to different context variables and samples.
Much of the followup analyses are commendably trying to take us a step furthertowards explaining the underlying observed trends of variable prediction accuracy for BMI. Some of these analyses, however, are somewhat confounded and hard to interpret.
For example, many of the covariates which the authors use to stratify the sample by may drive range restriction effects. Further, the covariates considered could be causally affected by genotype and causally affect BMI, with reverse causality effects; other covariates may be partially causally affected by both genotype and BMI, resulting in collider bias. Finally, population structure differences between quintiles of a covariate may drive variable levels of stratification. These can bias estimation and confounds interpretations, at least one of which intuitively seems like a concern for each of the context variables (e.g., the covariates SES, LDL, diet, age, smoking, and alcohol drinking).
The increased prediction accuracy observed with some of the agedependent prediction models is notable. Despite the clear utility of this investigation, I am not aware of much existing work that shows such improvements for contextaware prediction models (compared to additive/main effect models). I would be curious to see if the predictive utility extends to heldout data from a data set distinct from the UKB, where the model was trained, or whether it replicates when predicting variation within families. Such analyses could strengthen the evidence for these models capturing direct causal effects, rather than other reasons for the associations existing in the UKB sample.
Thank you for the comments. We agree there are certain biases that may be introduced in these analyses. For population structure between quintiles, the analyses are already stratified by ancestry and have the top 5 genetic principal components included, which may help with this issue. In the interaction models we included separate terms for the PGS of the covariate as well which was meant to better capture the environmental component of the covariates, which may partially ameliorate issues of collider bias as SNPs that are causally affecting both BMI and the covariate would be partially adjusted for. While range restriction effects could introduce bias, in the correlation analyses the relationship between main effects and interaction effects (which were estimated without range restriction) have strong and reproducible correlations with PGS R2 differences across datasets.
We agree the increased prediction performance using PGS created directly from GxAge GWAS effects is notable, as it is essentially “free” performance increase that doesn’t require any new data, and it likely generalizable to additional covariates. It would be useful to validate its performance in other datasets, especially ones that are outside of the 4069 age of UKBB.
Reviewer #3 (Public Review):
Polygenic scores (PGS), constructed based on genetic effect sizes estimated in genomewide association studies (GWAS) and used to predict phenotypes in humans have attracted considerable recent interest in human and evolutionary genetics, and in the social sciences. Recent work, however, has shown that PGSs have limited portability across ancestry groups, and that even within an ancestry group, their predictive accuracy varies markedly depending on characteristics such as the socioeconomic status, age, and sex of the individuals in the samples used to construct them and to which they are applied. This study takes further steps in investigating and addressing the later problem, focusing on body mass index, a phenotype of substantial biomedical interest. Specifically, it quantifies the effects of a large number of covariates and of interactions between these covariates and the PGS on prediction accuracy; it also examines the utility of including such covariates and interaction in the construction of predictors using both standard methods and artificial neural networks. This study would be of interest to investigators that develop and apply PGSs.
I should add that I have not worked on PGSs and am not a statistician, and apologize in advance if this has led to some misunderstandings.
Strengths:
 The paper presents a much more comprehensive assessment of the effects of covariates than previous studies. It finds many covariates to have a substantial effect, which further highlights the importance of this problem to the development and application of PGSs for BMI and more generally.
 The findings re the relationships between the effects of covariates and interactions between covariates and PGSs are, to the best of my knowledge, novel and interesting.
 The development of predictors that account for multiple covariates and their interaction with the PGS are, to the best of my knowledge, novel and may prove useful in future efforts to produce reliable PGSs.
 The improvement offered by the predictors that account for PGS and covariates using neural networks highlights the importance of nonlinear interactions that are not addressed by standard methods, which is both interesting and likely to be of future utility.
Thank for the positive feedback.
Weaknesses:
 The paper would benefit substantially from extensive editing. It also uses terminology that is specific to recent literature on PGSs, thus limiting accessibility to a broader readership.
 The potential meaning of most of the results is not explored. Some examples are provided below: • The paper emphasizes that 18/62 covariates examined show significant effects, but this result clearly depends on the covariates included. It would be helpful to provide more detail on how these covariates were chosen. Moreover, many of these covariates are likely to be correlated, making this result more difficult to interpret. Could these questions at least be partially addressed using the predictors constructed using all covariates and their interactions jointly (i.e., with LASSO)? In that regard, it would be helpful to know how many of the covariates and interactions were used in this predictor (I apologize if I missed that). • While the relationship between covariate effects and covariatePGS interaction effects is intriguing, it is difficult to interpret without articulating what one would expect, i.e., what would be an appropriate null. • The finding that using artificial neural networks substantially improves prediction over more standard methods is especially intriguing, and highlights the potential importance of nonlinear relationships between PGSs and covariates. These relationships remain hidden in a black box, however. Even fairly straightforward analyses, based on using different combinations of the PGS and/or covariates may shed some light on these relationships. For example, analyzing which covariates have a substantial effect on the prediction or varying one covariate at a time for different values of the PGS, etc.
 The relationship to previous work should be discussed in greater detail.
Thank you for the comments. Regarding running LASSO with all covariates along with each of their interactions with PGS in one model, upon reading those sections of the text again it is a little unclear we agree; but we actually did something very similar already (related sections have been edited for clarity in our revised manuscript) with these results being presented later on in the neural network section (second paragraph, S Table 7 – those results specifically aren’t in Figure 5). We just looked at changes in prediction performance, and did not try to interpret the model coefficients. We agree that many of the covariates are probably correlated, but based on the correlation results (Figure 4) it doesn’t seem like any covariate is especially important separately from its effect on BMI itself i.e., whatever covariates were chosen by LASSO may still not be especially important. This explanation is related to the interpretation of the neural network results, where neural networks improved performance even over linear models with just age and sex and their interactions with PGS as additional covariates, which may suggest that increased performance is coming from nonlinearities apart from multiplicative interaction effects with the PGS. So observing the coefficients from LASSO but still with a linear model may still not substantially aid in explaining the relationships that increase prediction performance using neural networks (additionally, this analysis may be difficult to replicate since many of the covariates are not present in multiple datasets). But this replication would be nice to see in future studies if such datasets exist. In terms of the null relationship between covariate main and interaction effects, if they are from the same model they will inherently be correlated, but the main effects from Figure 4 are from a main effects model only. Regarding the other points, the text will be edited for clarity and elaboration on specific topics.

eLife assessment
This study presents a valuable analysis of the effects of covariates, such as age, sex, socioeconomic status, or biomarker levels, on the predictive accuracy of polygenic scores for body mass index; it also presents approaches for improving prediction accuracy by accounting for such covariates. While the analyses are solid, the study falls short of providing a cogent interpretation of key findings, which could be of great interest and utility. The work will be of interest to people using and developing methods for phenotypic prediction based on polygenic scores.

Reviewer #1 (Public Review):
In this paper, Hui and colleagues investigate how the predictive accuracy of a polygenic score (PGS) for body mass index (BMI) changes when individuals are stratified by 62 different covariates. After showing that the PGS has different predictive power across strata for 18 out of 62 covariates, they turn to understanding why these differences and seeing if predictive performance could be improved. First, they investigated which types of covariates result in the largest differences in PGS predictive power, finding that covariates with larger "main effects" on the trait and covariates with larger interaction effects (interacting with the PGS to affect the trait) tend to better stratify individuals by PGS performance. The authors then see if including interactions between the PGS and covariates improves …
Reviewer #1 (Public Review):
In this paper, Hui and colleagues investigate how the predictive accuracy of a polygenic score (PGS) for body mass index (BMI) changes when individuals are stratified by 62 different covariates. After showing that the PGS has different predictive power across strata for 18 out of 62 covariates, they turn to understanding why these differences and seeing if predictive performance could be improved. First, they investigated which types of covariates result in the largest differences in PGS predictive power, finding that covariates with larger "main effects" on the trait and covariates with larger interaction effects (interacting with the PGS to affect the trait) tend to better stratify individuals by PGS performance. The authors then see if including interactions between the PGS and covariates improves predictive accuracy, finding that linear models only result in modest increases in performance but nonlinear models result in more substantial performance gains.
Overall, the results are interesting and wellsupported. The results will be broadly interesting to people using and developing PGS methods. Below I list some strengths and minor weaknesses.
Strengths:
A major impediment to the clinical use of PGS is the interaction between the PGS and various other routinely measured covariates, and this work provides a very interesting empirical study along these lines. The problem is interesting, and the work presented here is a convincing empirical study of the problem.
The result that PGS accuracy differs across covariates, but in a way that is not wellcaptured by linear models with interactions is important for PGS method development.
Weakness:
While arguably outside the scope of this paper, one shortcoming is the lack of a conceptual model explaining the results. It is interesting and empirically useful that PGS prediction accuracy differs across many covariates, but some of the results are hard to reconcile simultaneously. For example, it is interesting that triglyceride levels are associated with PGS performance across cohorts, but it seems like the effect on performance is discordant across datasets (Figure 2). Similarly, many of these effects have discordant (linear) interactions across cohorts (Figure 3). Overall it is surprising that the same covariates would be important but for presumably different reasons in different cohorts. Similarly, it would be good to discuss how the present results relate to the conceptual models in Mostafavi et al. (eLife 2020) and Zhu et al. (Cell Genomics 2023).

Reviewer #2 (Public Review):
This work follows in the footsteps of earlier work showing that BMI prediction accuracy can vary dramatically by context, even within a relatively ancestrally homogenous sample. This is an important observation that is worth the extension to different context variables and samples.
Much of the followup analyses are commendably trying to take us a step furthertowards explaining the underlying observed trends of variable prediction accuracy for BMI. Some of these analyses, however, are somewhat confounded and hard to interpret.
For example, many of the covariates which the authors use to stratify the sample by may drive range restriction effects. Further, the covariates considered could be causally affected by genotype and causally affect BMI, with reverse causality effects; other covariates may be partially …
Reviewer #2 (Public Review):
This work follows in the footsteps of earlier work showing that BMI prediction accuracy can vary dramatically by context, even within a relatively ancestrally homogenous sample. This is an important observation that is worth the extension to different context variables and samples.
Much of the followup analyses are commendably trying to take us a step furthertowards explaining the underlying observed trends of variable prediction accuracy for BMI. Some of these analyses, however, are somewhat confounded and hard to interpret.
For example, many of the covariates which the authors use to stratify the sample by may drive range restriction effects. Further, the covariates considered could be causally affected by genotype and causally affect BMI, with reverse causality effects; other covariates may be partially causally affected by both genotype and BMI, resulting in collider bias. Finally, population structure differences between quintiles of a covariate may drive variable levels of stratification. These can bias estimation and confounds interpretations, at least one of which intuitively seems like a concern for each of the context variables (e.g., the covariates SES, LDL, diet, age, smoking, and alcohol drinking).
The increased prediction accuracy observed with some of the agedependent prediction models is notable. Despite the clear utility of this investigation, I am not aware of much existing work that shows such improvements for contextaware prediction models (compared to additive/main effect models). I would be curious to see if the predictive utility extends to heldout data from a data set distinct from the UKB, where the model was trained, or whether it replicates when predicting variation within families. Such analyses could strengthen the evidence for these models capturing direct causal effects, rather than other reasons for the associations existing in the UKB sample.

Reviewer #3 (Public Review):
Polygenic scores (PGS), constructed based on genetic effect sizes estimated in genomewide association studies (GWAS) and used to predict phenotypes in humans have attracted considerable recent interest in human and evolutionary genetics, and in the social sciences. Recent work, however, has shown that PGSs have limited portability across ancestry groups, and that even within an ancestry group, their predictive accuracy varies markedly depending on characteristics such as the socioeconomic status, age, and sex of the individuals in the samples used to construct them and to which they are applied. This study takes further steps in investigating and addressing the later problem, focusing on body mass index, a phenotype of substantial biomedical interest. Specifically, it quantifies the effects of a large …
Reviewer #3 (Public Review):
Polygenic scores (PGS), constructed based on genetic effect sizes estimated in genomewide association studies (GWAS) and used to predict phenotypes in humans have attracted considerable recent interest in human and evolutionary genetics, and in the social sciences. Recent work, however, has shown that PGSs have limited portability across ancestry groups, and that even within an ancestry group, their predictive accuracy varies markedly depending on characteristics such as the socioeconomic status, age, and sex of the individuals in the samples used to construct them and to which they are applied. This study takes further steps in investigating and addressing the later problem, focusing on body mass index, a phenotype of substantial biomedical interest. Specifically, it quantifies the effects of a large number of covariates and of interactions between these covariates and the PGS on prediction accuracy; it also examines the utility of including such covariates and interaction in the construction of predictors using both standard methods and artificial neural networks. This study would be of interest to investigators that develop and apply PGSs.
I should add that I have not worked on PGSs and am not a statistician, and apologize in advance if this has led to some misunderstandings.
Strengths:
 The paper presents a much more comprehensive assessment of the effects of covariates than previous studies. It finds many covariates to have a substantial effect, which further highlights the importance of this problem to the development and application of PGSs for BMI and more generally.
 The findings re the relationships between the effects of covariates and interactions between covariates and PGSs are, to the best of my knowledge, novel and interesting.
 The development of predictors that account for multiple covariates and their interaction with the PGS are, to the best of my knowledge, novel and may prove useful in future efforts to produce reliable PGSs.
 The improvement offered by the predictors that account for PGS and covariates using neural networks highlights the importance of nonlinear interactions that are not addressed by standard methods, which is both interesting and likely to be of future utility.Weaknesses:
 The paper would benefit substantially from extensive editing. It also uses terminology that is specific to recent literature on PGSs, thus limiting accessibility to a broader readership.
 The potential meaning of most of the results is not explored. Some examples are provided below:
• the paper emphasizes that 18/62 covariates examined show significant effects, but this result clearly depends on the covariates included. It would be helpful to provide more detail on how these covariates were chosen. Moreover, many of these covariates are likely to be correlated, making this result more difficult to interpret. Could these questions at least be partially addressed using the predictors constructed using all covariates and their interactions jointly (i.e., with LASSO)? In that regard, it would be helpful to know how many of the covariates and interactions were used in this predictor (I apologize if I missed that).
• While the relationship between covariate effects and covariatePGS interaction effects is intriguing, it is difficult to interpret without articulating what one would expect, i.e., what would be an appropriate null.
• The finding that using artificial neural networks substantially improves prediction over more standard methods is especially intriguing, and highlights the potential importance of nonlinear relationships between PGSs and covariates. These relationships remain hidden in a black box, however. Even fairly straightforward analyses, based on using different combinations of the PGS and/or covariates may shed some light on these relationships. For example, analyzing which covariates have a substantial effect on the prediction or varying one covariate at a time for different values of the PGS, etc.
 The relationship to previous work should be discussed in greater detail. 