Novel discoveries and enhanced genomic prediction from modelling genetic risk of cancer age-at-onset

Ekaterina S. Maksimova
Sven E. Ojavee
Kristi Läll
Marie C. Sadler
Reedik Mägi
Zoltan Kutalik
Matthew R. Robinson

Curated by eLife

eLife assessment

This manuscript is a useful contribution to the field of complex trait genomics. The study does have some real strengths, such as focusing on cancer age-of-onset, developing methods for this unusual trait and using two cohorts. However, the significance of findings is difficult to evaluate without further comparisons and validations, leaving the work in its current form incomplete.

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (eLife)

Abstract

Genome-wide association studies seek to attribute disease risk to DNA regions and facilitate subject-specific prediction and patient stratification. For later-life diseases, inference from case-control studies is hampered by the uncertainty that control group subjects might later be diagnosed. Time-to-event analysis treats controls as right-censored, making no additional assumptions about future disease occurrence and represents a more sound conceptual alternative for more accurate inference. Here, using data on 11 common cancers from the UK and Estonian Biobank studies, we provide empirical evidence that discovery and genomic prediction are greatly improved by analysing age-at-diagnosis, compared to a case-control model of association. We replicate previous findings from large-scale case-control studies and find an additional 7 previously unreported independent genomic regions, out of which 3 replicated in independent data. Our novel discoveries provide new insights into underlying cancer pathways, and our model yields a better understanding of the polygenicity and genetic architecture of the 11 tumours. We find that heritable germline genetic variation plays a vital role in cancer occurrence, with risk attributable to many thousands of underlying genomic regions. Finally, we show that Bayesian modelling strategies utilising time-to-event data increase prediction accuracy by an average of 20% compared to a recent summary statistic approach (LDpred-funct). As sample sizes increase, incorporating time-to-event data should be commonplace, improving case-control studies by using richer information about the disease process.

Version published to 10.7554/elife.89882.1 on eLife
Oct 9, 2023
Version published to 10.7554/elife.89882 on eLife
Oct 9, 2023
eLife
Oct 6, 2023

eLife assessment

This manuscript is a useful contribution to the field of complex trait genomics. The study does have some real strengths, such as focusing on cancer age-of-onset, developing methods for this unusual trait and using two cohorts. However, the significance of findings is difficult to evaluate without further comparisons and validations, leaving the work in its current form incomplete.

Read the original source
eLife
Oct 6, 2023

Reviewer #1 (Public Review):

Summary:
In this paper the authors present genome-wide association analyses of 11 different cancers including time-to-event analyses. The authors use two recently published Bayesian methods, one of which is constructed to handle time-to-event data. The authors demonstrate that polygenic risk scores trained on these models give nominally better predictions than standard polygenic risk scores. Further they show that performing 11 GWASs in UKB while adjusting for the polygenic effects estimated by their improved predictor, they find seven novel loci are implicated by one or both of these methods of which the authors find that three replicate in Estonian Biobank.

Strengths:
A clear strength is that the authors evaluate the performance of the model in a completely different dataset (Estonian Biobank) than the one …

Reviewer #1 (Public Review):

Summary:
In this paper the authors present genome-wide association analyses of 11 different cancers including time-to-event analyses. The authors use two recently published Bayesian methods, one of which is constructed to handle time-to-event data. The authors demonstrate that polygenic risk scores trained on these models give nominally better predictions than standard polygenic risk scores. Further they show that performing 11 GWASs in UKB while adjusting for the polygenic effects estimated by their improved predictor, they find seven novel loci are implicated by one or both of these methods of which the authors find that three replicate in Estonian Biobank.

Strengths:
A clear strength is that the authors evaluate the performance of the model in a completely different dataset (Estonian Biobank) than the one it is trained in.

Weaknesses:
The 11 phenotypes that the authors chose have the challenge that they are rare, particularly in healthy biobank participants, which means that (i) the benefit of modeling it as a time-to-event analysis is expected to be smaller and (ii) that models have to be stable under imbalanced case/control fractions. In GWAS analyses authors handle this second problem by using a recently published association test, which is robust to imbalanced data, which likely means that they avoid inflated test statistics, but also that they do not leverage the actual time-to-event information to its full potential.

The authors chose not to use the recently published methods BayesRR-RC and BayesW, but instead they run these models and then add an extra step where they run a logistic regression with an offset term set to the LOCO genomic values as estimated by GRMR-BayesW and GRMR-BayesRR-RC respectively. They write that this was because of the imbalanced case/control proportion, but not how the problem was detected. If the authors have insight about when the standard GRMR-BayesW and GRMR-BayesRR-RC become unreliable, I think it would be helpful to share in this paper. Further, if the associations implicated by standard GRMR-BayesW and GRMR-BayesRR-RC are not reliable, I think we need some justification that the variance components reported in Figure 1 are still reliable.

The authors chose to compare the two new GWAS methods, GMRM-BayesW-adjusted and GMRM-BayesRR-RC-adjusted, to REGENIE, so an obvious first question in my opinion is if GMRM-BayesW-adjusted and GMRM-BayesRR-RC-adjusted find more signal than REGENIE.
a. We see that 7 loci where found by GMRM-BayesW but not by REGENIE, but how many were found by REGENIE but not by GMRM-BayesW?
b. Figure S5 as I understand it is showing that the mean -log(p-value) is lower in GMRM-BayesW than REGENIE for variants that have a p-value in GMRM-BayesW that is lower than 5e-8. I don't think this is a valid way to check if GMRM-BayesW has more power. I have a feeling that there could be a winner's curse-like phenomenon here. I think a more principled comparison could be provided.

The title of the paper ("Novel discoveries and enhanced genomic prediction from modelling genetic risk of cancer age-at-onset") seems to imply that the age of onset informed model (GMRM-BayesW) does better. But I think the foundation for that statement could be strengthened.
Figure S6 shows that 261 previously reported loci were replicated by GMRM-BayesW-adjusted whereas 256 were replicated by GMRM-BayesRR-RC. How were previously reported loci defined? did they include UKB data? and how many where there in total?
In the PRS analyses presented in Figure 3a GMRM-BayesW does better than GMRM-BayesRR-RC in 8/11 phenotypes, which does not itself appear significant to me. And with overlapping confidence intervals the significance of the improvement is hard to see.

In Table 1 it says that rs35763415, rs117972357 and rs7902587 replicated in the Estonian Biobank but Figure 3b it says that rs35763415, rs117972357 and rs1015362 replicated in the Estonian Biobank. What is the difference between these two analyses? In the methods it says that you checked your findings for replication in FinnGen, but I don't see any results from FinnGen anywhere?

Read the original source
eLife
Oct 6, 2023

Reviewer #2 (Public Review):

Summary: Maksimova, Ojavee, and colleagues extend two of their methods, BayesW and BayesRR-RC to be used as mixed-model association methods by combining them with a similar approach as in step 2 of REGENIE. BayesW handles time-to-event data whereas BayesRR-RC works for case-control phenotypes. They provide UKBB results for 11 cancers and replicate findings and assess predictions in the Estonian biobank.

Strengths: Age-of-onset is becoming more and more available, and developing methods that make the best use of this additional information is valuable.

Weaknesses: In this work, there is (for now) limited validation of results and comparison with other existing methods.

Read the original source
Version published to 10.1101/2022.03.25.22272955 on medRxiv
Mar 31, 2022

DrugSAGE: an aggregation-based method for drug response imputation

This article has 2 authors:
1. Peilin Jia
2. Zhongming Zhao
This article has no evaluationsLatest version Mar 12, 2026
The largest-ever cross-ancestry meta-analysis GWAS for SLE identifies novel biology and potential treatment targets

This article has 30 authors:
1. Chikashi Terao
2. Yuki Ishikawa
3. Nao Tanaka
4. Masaru Koido
5. Kohei Tomizuka
6. Steven Gazal
7. Timothy Vyse
8. David Morris
9. Nick Dand
10. Sang-Cheol Bae
11. Kwangwoo kim
12. Ayeong Kwon
13. Hye-Soon Lee
14. So-Young Bang
15. Young Bin Joo
16. Xue-jun Zhang
17. Xianyong Yin
18. Yong Cui
19. FuSheng Zhou
20. Bo Zhang
21. Lu Liu
22. Zhengwei Zhu
23. Wanling Yang
24. Yong-Fei Wang
25. Yao Lei
26. Yan Zhang
27. Zhiming Lin
28. Yu Lung Lau
29. Nattiya Hirankarn
30. Pattarin Tangtanatakul
This article has no evaluationsLatest version Mar 9, 2026
Combining cell type-specific genomic/epigenomic analyses and experimental validation identifies genetic drivers across three neurodegenerative diseases

This article has 18 authors:
1. Yuan Hou
2. Liam Wetzel
3. Xin Chen
4. Xiaoyu Yang
5. Dina Bugybayeva
6. Teresa Thomas
7. Wenqiang Song
8. Zhigang Liu
9. Yichen Li
10. Zhibing Tan
11. Ming Hu
12. Yang Li
13. Xiongwei Zhu
14. Jagan Pillai
15. Andrew Pieper
16. Jeffrey Cummings
17. Tian Liu
18. Feixiong Cheng
This article has no evaluationsLatest version Mar 3, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

DrugSAGE: an aggregation-based method for drug response imputation

The largest-ever cross-ancestry meta-analysis GWAS for SLE identifies novel biology and potential treatment targets

Combining cell type-specific genomic/epigenomic analyses and experimental validation identifies genetic drivers across three neurodegenerative diseases