Nomograms of human hippocampal volume shifted by polygenic scores

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    This manuscript considers whether genetic information can improve the clinical utility of population norms derived from brain imaging data. The authors propose to incorporate polygenic scores into normative models of hippocampal volume to improve predictions of neurodegenerative disease. This approach is elegantly demonstrated in this manuscript and may be useful for clinical translation of population neuroimaging.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 and Reviewer #2 agreed to share their name with the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Nomograms are important clinical tools applied widely in both developing and aging populations. They are generally constructed as normative models identifying cases as outliers to a distribution of healthy controls. Currently used normative models do not account for genetic heterogeneity. Hippocampal volume (HV) is a key endophenotype for many brain disorders. Here, we examine the impact of genetic adjustment on HV nomograms and the translational ability to detect dementia patients. Using imaging data from 35,686 healthy subjects aged 44–82 from the UK Biobank (UKB), we built HV nomograms using Gaussian process regression (GPR), which – compared to a previous method – extended the application age by 20 years, including dementia critical age ranges. Using HV polygenic scores (HV-PGS), we built genetically adjusted nomograms from participants stratified into the top and bottom 30% of HV-PGS. This shifted the nomograms in the expected directions by ~100 mm 3 (2.3% of the average HV), which equates to 3 years of normal aging for a person aged ~65. Clinical impact of genetically adjusted nomograms was investigated by comparing 818 subjects from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database diagnosed as either cognitively normal (CN), having mild cognitive impairment (MCI) or Alzheimer’s disease (AD) patients. While no significant change in the survival analysis was found for MCI-to-AD conversion, an average of 68% relative decrease was found in intra-diagnostic-group variance, highlighting the importance of genetic adjustment in untangling phenotypic heterogeneity.

Article activity feed

  1. Author Response

    Reviewer #1 (Public Review):

    In this paper, the authors estimate growth curves ('nomograms') for hippocampal volume (HV) using Gaussian process regression applied to UK Biobank data and evaluate the influence of polygenic scores for HV on the estimated centile curves. By taking this into account, the centile scores are shifted up or down accordingly. The authors then apply this to the ADNI cohort and show that subjects with dementia mostly lie in the lower centiles, but this does not improve the prediction of transition from mild cognitive impairment to dementia.

    This paper is reasonably well written and the finding that centile curves for different phenotypes are sensitive to genetic features will be of interest to many in the field, albeit perhaps somewhat unsurprising given the polygenic score evaluated here is for the same phenotype under investigation (i.e. HV). I think using centiles derived from nomograms/normative models for precisely assessing both current staging and progression of neurological disorders is a highly promising direction. Regarding this manuscript, I have a few comments about the methodology and interpretation of results, which I will outline below.

    • My most significant concern is that It appears that the assumption of Gaussian residuals is violated by the HV phenotypes that the authors fit their GP to. For example, in figure 2, the distribution is clearly skewed, and the lower centiles -in particular- are poorly fit to the data. First, please provide additional metrics to assess the fit and calibration of these models quantitatively (the latter can be done e.g. via Q-Q plots).

    Thanks for pointing this out. We are sorry for causing this confusion. The skew in the figure appears because the scatter plot overlayed with the GP-generated nomogram is showing ADNI samples of all diagnoses – not the UKB training data used for the GP. The lower centiles are mainly occupied by the participants with AD or MCI (see the new plots in Figure 5). In addition, the healthy subjects from ADNI do indeed fit the model reasonably well. We have added a supplementary figure to show just the healthy subject and have made the following edits in the text to address the confusion:

    Lines 143-149: “Nomograms of healthy subjects generated using the SWA and GPR method displayed similar trends (Figure 2; Supplementary Figure S8). … This extension allowed 86% of all diagnostic groups from the ADNI to be evaluated versus 56% in the SWA Nomograms (Figure 2; Figure 2 – Figure Supplement 2).”

    Lines 159-170 (description of figure 2): “Figure 2: Comparing Nomogram Generation Methods. Nomograms produced from healthy UKB subjects using the sliding window approach (SWA) (red lines) and gaussian process regression (GPR) method (grey lines) … The benefits of this extension can be seen with scatter plots of ADNI subjects of all diagnoses overlayed (E, F… A similar figure with only the Cognitively Normal ADNI subjects can be found in Figure 2 – Figure Supplement 2

    Second, I think if the authors wish to make precise inferences about the centile distribution for the reference model, then the deviation from Gaussianity ought to be accommodated in some manner. There are several options for this, including different noise models (e.g. Gamma, inverse Gamma, SHASH, etc), variable transformation, or quantile regression. One option that could be useful in the context of Gaussian process regression is the use of likelihood warping (see e.g. Fraza et al 2021 Neuroimage and references therein) which was originally developed for GP models. I would recommend the authors pursue one of these routes and provide metrics to properly gauge the fit.

    This is an excellent point. However, we believe that given that the training data indeed follows a Gaussian distribution (see new Figure 4 – Figure Supplement 3; reproduced below) across the relevant strata (sex, PGS) and across age groups, such modifications are not required.

    • Related to the above, it is likely that the selection of subjects with high/low polygenic scores for HV changes the shape of the distribution. It is currently impossible to assess this because no data points are shown in these cases. Please also add this information, along with comparable quantitative metrics to those for the models above.

    Thank you for bringing this up. We have now added a new supplementary figure with the shape of these distributions along with the Shapiro-Wilkens test results for each of them. As can be seen, the Shapiro-Wilkens tests detects mild deviation from Normality in some cases. However, given the size of the strata N>2000 this is not surprising. Moreover, would multiple testing be applied here across the 48 comparisons, then none of the tests would be significant at the corrected threshold (P<0.001).

    • How did the authors handle site effects? There appears to be no adjustment for the fact that the ADNI data are acquired from different sites that were not used during the estimation of the normative models. I would expect to see this dealt with properly (e.g. via fixed or random effects included in the modelling) or at the very least a convincing demonstration that site effects are not clearly biasing the results.

    We agree that site effects are a major issue; we have rerun the application experiments after adjusting the ADNI volumes with NeuroCombat. The results did not change significantly, but we have changed all the reported results with the updated results. In addition, we noted this in the methods section:

    Lines 442-445: Finally, we used NeuroCombat 1 to adjust across ADNI sites and harmonize the volumes with the UKB Dataset. To do this we modelled 58 batches (UKB data as one batch and 57 ADNI sites as separate batches) and added ICV, sex, and diagnosis (assigning all UKB as Healthy and using the diagnosis columns in ADNI) to retain biological variation.

    • How do the authors interpret the finding that the relationship between the polygenic scores and HV is different in the cohorts they consider (i.e. bimodal in UKB and unimodal in ADNI)? Does this call into question the appropriateness of the subsampled model for the clinical cohort?

    While we do see a bimodal distribution in UKB the effect is not very strong as the other reviewers commented. Therefore, we have de-emphasized this aspect. One reason may be that we detect the slightly bimodal aspect in UKB because of greater statistical power due to the large sample size (one order of magnitude). One further aspect is the used SNP data, i.e., differences in genotyping platform and imputation. This is also the reason why integrating PGS directly into the predictive model comes with additional challenges. We have addressed this topic briefly in our discussion: Lines 390-392: “Lastly, a recent study of PGS uncertainty revealed large variance in PGS estimates63, which may undermine PGS based stratification; hence a more sophisticated method of building PGS or stratification may improve results further.”

    • Perhaps the authors can comment on (or better, evaluate) how this genetic shift could be accommodated in normative models (e.g. the possibility of including polygenic risk scores as predictor variables in the normative model). This would remove the need for post hoc adjustment and would allow more precise control over the adjustment than just taking the upper/lower xxx % of the PGS distribution as is done in the current manuscript.

    We agree that integration of the genetics directly into the normative models is a great idea. And this will be the direction we will be exploring in future work. However, PGS themselves are prone to show ‘site’ effects that depend on the genotyping method that was used as well as of the quality of genotyping and imputation. As a consequence, using the ‘raw’ PGS scores in predictive models brings its own challenges. Therefore, we feel that the current framework is simpler at this point and illustrates the potential of PGS when combined with normative models.

    • Related to my point above, it is perhaps unsurprising that the polygenic score for the HV phenotype influences the centile distribution. I think the paper would benefit considerably by also evaluating other polygenic scores (e.g., APOE4 as in some of the prior cited references). it would be interesting to compare the magnitude and shape differences for these adjustments. The authors can consider this an optional suggestion.

    Our rationale for focusing on HV PGS was that we sought to improve the accuracy of the normative model. The genetics influences HV and this is a first attempt to adjust for this in the normative modeling framework. Indeed, APOE-e4 has a sizable effect on HV. However, this is most likely mediated by nascent accelerated neurodegeneration, i.e., Alzheimer’s disease. Thus, in our view focusing on APOE-e4 would mean to focus on a disease effect. We address this issue briefly in the discussion (Lines 326-334). For sensitivity analysis, we did indeed test other PGS, such as AD and Whole-Brain-Volume, and found that these do not affect the normative models for HV.

    Reviewer #3 (Public Review):

    Given the large variation in and high heritability of hippocampus volume in the population, taking out known variation in the healthy population is a nice way of reducing heterogeneity, and a step forward towards using normative models in clinical practice. The dataset the nomograms are based on is large enough to do so even when stratified by polygenic scores for hippocampal volume, and these provide interesting information on the role of genetics in hippocampus volume.

    There are however several concerns regarding the applicability of the models to the ADNI dataset. First, the lack of overlap in the age range between the dataset the model is trained on and the application to subjects that are outside that age range is questionable. The authors prefer Gaussian process regression (GPR) over a sliding window-based approach using the argument that the former allows for predictions in a larger age range but extrapolation beyond the reach of the data is usually not valid. The claim that Supplementary Figure 6 shows accurate extension beyond these limits is in my opinion not justified. If anything, we can be rather certain that the extensive growth of the hippocampus up to age 48 is not realistic (see e.g. Dima et al., 2022).

    As mentioned already in response to reviewer #1, this was a miscommunication on our side. We only used the ADNI samples that were within the age range of the models they were being plotted against. The GPR model did not require smoothing at the edges of the age-range and thus can support a wider age range than the SWA. This is why we stated that the extension of the nomograms enabled more of the ADNI dataset to be used, i.e., because otherwise these samples were outside the range of the model and could not be used.

    We have changed the following lines in the manuscript to make this idea explicit:

    Lines 477-478 (end of GPR methods section): “For both SWM and GPR models, we only tested the ADNI samples that lay within the age range of each model respectively.”

    Regarding the accurate extension claim, we have edited the line (411-412) in the discussion so that it now reads:

    Lines 347-348 “In fact, our GPR model can potentially be extended a few years beyond those limits”

    Thank you for pointing out the discrepancy in the hippocampal growth around 48 with the results by Dima et al. 2022. Although sample sizes between the two studies are similar. The data availability in UKB for ages 45-50 is rather sparse (N<100; see new Figure 4 – Figure Supplement 3). Thus, the observed growth is likely due to under sampling. The growth effect has been observed in other studies using UKB data7,8. We have noted this in the discussion:

    Lines 354-356:” However, there is a possibility that our results suffer from edge effects. For example, we suspect that the peak noted in the male nomogram is likely due to under-sampling in the younger participants.”

    Second, the drop in mean 'percentile' difference between high and low polygenic scoring individuals that if one uses genetically adjusted nomograms seems nice, but this difference is currently just a number and the reader cannot see whether this difference is significant, or clinically relevant.

    We have now provided a new figure (Figure 5) that shows the boxplots behind those numbers. The MCI-to-AD conversion analyses in the ADNI explored the clinical benefit of genetically adjusted nomograms. However, adjusted, and un-adjusted percentiles performed equally well. In the discussion we argue that the MCI stage is already too late and earlier stages may benefit from the increased precision:

    Lines 373-378: “However, despite this sizable effect, genetically adjusted nomograms did not provide additional insight into distinguishing MCI subjects that remained stable or converted to AD. Nonetheless, the added precision may prove more useful in early detection of deviation among CN subjects, for instance in detecting subtle hippocampal volume loss in individuals with presymptomatic neurodegeneration.”

  2. Evaluation Summary:

    This manuscript considers whether genetic information can improve the clinical utility of population norms derived from brain imaging data. The authors propose to incorporate polygenic scores into normative models of hippocampal volume to improve predictions of neurodegenerative disease. This approach is elegantly demonstrated in this manuscript and may be useful for clinical translation of population neuroimaging.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 and Reviewer #2 agreed to share their name with the authors.)

  3. Reviewer #3 (Public Review):

    Given the large variation in and high heritability of hippocampus volume in the population, taking out known variation in the healthy population is a nice way of reducing heterogeneity, and a step forward towards using normative models in clinical practice. The dataset the nomograms are based on is large enough to do so even when stratified by polygenic scores for hippocampal volume, and these provide interesting information on the role of genetics in hippocampus volume.

    There are however several concerns regarding the applicability of the models to the ADNI dataset. First, the lack of overlap in the age range between the dataset the model is trained on and the application to subjects that are outside that age range is questionable. The authors prefer Gaussian process regression (GPR) over a sliding window-based approach using the argument that the former allows for predictions in a larger age range but extrapolation beyond the reach of the data is usually not valid. The claim that Supplementary Figure 6 shows accurate extension beyond these limits is in my opinion not justified. If anything, we can be rather certain that the extensive growth of the hippocampus up to age 48 is not realistic (see e.g. Dima et al., 2022). Second, the drop in mean 'percentile' difference between high and low polygenic scoring individuals that if one uses genetically adjusted nomograms seems nice, but this difference is currently just a number and the reader cannot see whether this difference is significant, or clinically relevant.

  4. Reviewer #2 (Public Review):

    There is much to be commended about the goal of integrating genetic risk into normative model estimates in order to improve diagnostic utility. Choosing a well-validated biomarker and genetic risk profile and a well-powered combination of datasets are further strengths of the present work. The authors chose a well-validated and robust analytical approach for generating normative models in Gaussian Process Regression. My general assessment is that this is a solid piece of scientific research that took a specific hypothesis and evaluated it well. More broadly they provide an interesting model for how one can integrate genetics into a normative modelling imaging framework.

    As this paper could potentially serve as such a role-model function there may be some elements of the methodology and the results that could be further expanded upon in the main manuscript. Some extended evaluation of potential technical sources of variation could be included, but principally I think the integration of weights into GPR directly could be discussed more in-depth alongside the evaluation of in which scenario's this may be appropriate to do. The authors could also speculate on whether a similar methodology is applicable in other contexts or for other combinations of data types.

  5. Reviewer #1 (Public Review):

    In this paper, the authors estimate growth curves ('nomograms') for hippocampal volume (HV) using Gaussian process regression applied to UK Biobank data and evaluate the influence of polygenic scores for HV on the estimated centile curves. By taking this into account, the centile scores are shifted up or down accordingly. The authors then apply this to the ADNI cohort and show that subjects with dementia mostly lie in the lower centiles, but this does not improve the prediction of transition from mild cognitive impairment to dementia.

    This paper is reasonably well written and the finding that centile curves for different phenotypes are sensitive to genetic features will be of interest to many in the field, albeit perhaps somewhat unsurprising given the polygenic score evaluated here is for the same phenotype under investigation (i.e. HV). I think using centiles derived from nomograms/normative models for precisely assessing both current staging and progression of neurological disorders is a highly promising direction. Regarding this manuscript, I have a few comments about the methodology and interpretation of results, which I will outline below.

    - My most significant concern is that It appears that the assumption of Gaussian residuals is violated by the HV phenotypes that the authors fit their GP to. For example, in figure 2, the distribution is clearly skewed and the lower centiles in particular are poorly fit to the data. First, please provide additional metrics to assess the fit and calibration of these models quantitatively (the latter can be done e.g. via Q-Q plots). Second, I think if the authors wish to make precise inferences about the centile distribution for the reference model, then the deviation from Gaussianity ought to be accommodated in some manner. There are several options for this, including different noise models (e.g. Gamma, inverse Gamma, SHASH, etc), variable transformation, or quantile regression. One option that could be useful in the context of Gaussian process regression is the use of likelihood warping (see e.g. Fraza et al 2021 Neuroimage and references therein) which was originally developed for GP models. I would recommend the authors pursue one of these routes and provide metrics to properly gauge the fit.

    - Related to the above, it is likely that the selection of subjects with high/low polygenic scores for HV changes the shape of the distribution. It is currently impossible to assess this because no data points are shown in these cases. Please also add this information, along with comparable quantitative metrics to those for the models above.

    - How did the authors handle site effects? There appears to be no adjustment for the fact that the ADNI data are acquired from different sites that were not used during the estimation of the normative models. I would expect to see this dealt with properly (e.g. via fixed or random effects included in the modelling) or at the very least a convincing demonstration that site effects are not clearly biasing the results.

    - How do the authors interpret the finding that the relationship between the polygenic scores and HV is different in the cohorts they consider (i.e. bimodal in UKB and unimodal in ADNI)? Does this call into question the appropriateness of the subsampled model for the clinical cohort?

    - Perhaps the authors can comment on (or better, evaluate) how this genetic shift could be accommodated in normative models (e.g. the possibility of including polygenic risk scores as predictor variables in the normative model). This would remove the need for post hoc adjustment and would allow more precise control over the adjustment than just taking the upper/lower xxx % of the PGS distribution as is done in the current manuscript.

    - Related to my point above, it is perhaps unsurprising that the polygenic score for the HV phenotype influences the centile distribution. I think the paper would benefit considerably by also evaluating other polygenic scores (e.g. APOE4 as in some of the prior cited references). it would be interesting to compare the magnitude and shape differences for these adjustments. The authors can consider this an optional suggestion.