Novel discoveries and enhanced genomic prediction from modelling genetic risk of cancer age-at-onset

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife assessment

    This manuscript is a useful contribution to the field of complex trait genomics. The study does have some real strengths, such as focusing on cancer age-of-onset, developing methods for this unusual trait and using two cohorts. However, the significance of findings is difficult to evaluate without further comparisons and validations, leaving the work in its current form incomplete.

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Genome-wide association studies seek to attribute disease risk to DNA regions and facilitate subject-specific prediction and patient stratification. For later-life diseases, inference from case-control studies is hampered by the uncertainty that control group subjects might later be diagnosed. Time-to-event analysis treats controls as right-censored, making no additional assumptions about future disease occurrence and represents a more sound conceptual alternative for more accurate inference. Here, using data on 11 common cancers from the UK and Estonian Biobank studies, we provide empirical evidence that discovery and genomic prediction are greatly improved by analysing age-at-diagnosis, compared to a case-control model of association. We replicate previous findings from large-scale case-control studies and find an additional 7 previously unreported independent genomic regions, out of which 3 replicated in independent data. Our novel discoveries provide new insights into underlying cancer pathways, and our model yields a better understanding of the polygenicity and genetic architecture of the 11 tumours. We find that heritable germline genetic variation plays a vital role in cancer occurrence, with risk attributable to many thousands of underlying genomic regions. Finally, we show that Bayesian modelling strategies utilising time-to-event data increase prediction accuracy by an average of 20% compared to a recent summary statistic approach (LDpred-funct). As sample sizes increase, incorporating time-to-event data should be commonplace, improving case-control studies by using richer information about the disease process.

Article activity feed

  1. eLife assessment

    This manuscript is a useful contribution to the field of complex trait genomics. The study does have some real strengths, such as focusing on cancer age-of-onset, developing methods for this unusual trait and using two cohorts. However, the significance of findings is difficult to evaluate without further comparisons and validations, leaving the work in its current form incomplete.

  2. Reviewer #1 (Public Review):

    Summary:
    In this paper the authors present genome-wide association analyses of 11 different cancers including time-to-event analyses. The authors use two recently published Bayesian methods, one of which is constructed to handle time-to-event data. The authors demonstrate that polygenic risk scores trained on these models give nominally better predictions than standard polygenic risk scores. Further they show that performing 11 GWASs in UKB while adjusting for the polygenic effects estimated by their improved predictor, they find seven novel loci are implicated by one or both of these methods of which the authors find that three replicate in Estonian Biobank.

    Strengths:
    A clear strength is that the authors evaluate the performance of the model in a completely different dataset (Estonian Biobank) than the one it is trained in.

    Weaknesses:
    The 11 phenotypes that the authors chose have the challenge that they are rare, particularly in healthy biobank participants, which means that (i) the benefit of modeling it as a time-to-event analysis is expected to be smaller and (ii) that models have to be stable under imbalanced case/control fractions. In GWAS analyses authors handle this second problem by using a recently published association test, which is robust to imbalanced data, which likely means that they avoid inflated test statistics, but also that they do not leverage the actual time-to-event information to its full potential.

    The authors chose not to use the recently published methods BayesRR-RC and BayesW, but instead they run these models and then add an extra step where they run a logistic regression with an offset term set to the LOCO genomic values as estimated by GRMR-BayesW and GRMR-BayesRR-RC respectively. They write that this was because of the imbalanced case/control proportion, but not how the problem was detected. If the authors have insight about when the standard GRMR-BayesW and GRMR-BayesRR-RC become unreliable, I think it would be helpful to share in this paper. Further, if the associations implicated by standard GRMR-BayesW and GRMR-BayesRR-RC are not reliable, I think we need some justification that the variance components reported in Figure 1 are still reliable.

    The authors chose to compare the two new GWAS methods, GMRM-BayesW-adjusted and GMRM-BayesRR-RC-adjusted, to REGENIE, so an obvious first question in my opinion is if GMRM-BayesW-adjusted and GMRM-BayesRR-RC-adjusted find more signal than REGENIE.
    a. We see that 7 loci where found by GMRM-BayesW but not by REGENIE, but how many were found by REGENIE but not by GMRM-BayesW?
    b. Figure S5 as I understand it is showing that the mean -log(p-value) is lower in GMRM-BayesW than REGENIE for variants that have a p-value in GMRM-BayesW that is lower than 5e-8. I don't think this is a valid way to check if GMRM-BayesW has more power. I have a feeling that there could be a winner's curse-like phenomenon here. I think a more principled comparison could be provided.

    The title of the paper ("Novel discoveries and enhanced genomic prediction from modelling genetic risk of cancer age-at-onset") seems to imply that the age of onset informed model (GMRM-BayesW) does better. But I think the foundation for that statement could be strengthened.
    Figure S6 shows that 261 previously reported loci were replicated by GMRM-BayesW-adjusted whereas 256 were replicated by GMRM-BayesRR-RC. How were previously reported loci defined? did they include UKB data? and how many where there in total?
    In the PRS analyses presented in Figure 3a GMRM-BayesW does better than GMRM-BayesRR-RC in 8/11 phenotypes, which does not itself appear significant to me. And with overlapping confidence intervals the significance of the improvement is hard to see.

    In Table 1 it says that rs35763415, rs117972357 and rs7902587 replicated in the Estonian Biobank but Figure 3b it says that rs35763415, rs117972357 and rs1015362 replicated in the Estonian Biobank. What is the difference between these two analyses? In the methods it says that you checked your findings for replication in FinnGen, but I don't see any results from FinnGen anywhere?

  3. Reviewer #2 (Public Review):

    Summary: Maksimova, Ojavee, and colleagues extend two of their methods, BayesW and BayesRR-RC to be used as mixed-model association methods by combining them with a similar approach as in step 2 of REGENIE. BayesW handles time-to-event data whereas BayesRR-RC works for case-control phenotypes. They provide UKBB results for 11 cancers and replicate findings and assess predictions in the Estonian biobank.

    Strengths: Age-of-onset is becoming more and more available, and developing methods that make the best use of this additional information is valuable.

    Weaknesses: In this work, there is (for now) limited validation of results and comparison with other existing methods.