Charting brain growth and aging at high spatial precision

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    This manuscript is of broad interest to the neuroimaging community. It establishes a detailed reference model of human brain development and lifespan trajectories based on a very large data set, across many cortical and subcortical brain regions. The model not only explains substantial variability on test data, it also successfully uncovers individual differences on a database of psychiatric patients that, in addition to group-level analyses, may be critical for diagnosis, thereby demonstrating high clinical potential. It presents a clear overview of the data resource, including detailed evaluation metrics, and makes code, models and documentation directly available to the community.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1, Reviewer #2 and Reviewer #3 agreed to share their name with the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Defining reference models for population variation, and the ability to study individual deviations is essential for understanding inter-individual variability and its relation to the onset and progression of medical conditions. In this work, we assembled a reference cohort of neuroimaging data from 82 sites (N=58,836; ages 2–100) and used normative modeling to characterize lifespan trajectories of cortical thickness and subcortical volume. Models are validated against a manually quality checked subset (N=24,354) and we provide an interface for transferring to new data sources. We showcase the clinical value by applying the models to a transdiagnostic psychiatric sample (N=1985), showing they can be used to quantify variability underlying multiple disorders whilst also refining case-control inferences. These models will be augmented with additional samples and imaging modalities as they become available. This provides a common reference platform to bind results from different studies and ultimately paves the way for personalized clinical decision-making.

Article activity feed

  1. Author Response:

    Reviewer #1 (Public Review):

    The underlying data are dominated by data from the UK Biobank, which means that, in effect, only few samples for the 25-50 age group are available. This may not be a big issue in terms of estimating smooth trajectories, but may limit comparisons to the reference model in certain cases (e.g. early disease onset) where this age range may be of particular interest.

    We show per site evaluation metrics, cross validation, and additional transfer examples. These additional analyses show that the model performance is not driven solely by the UKB sample. However, we agree with this comment and have also updated the limitation section (in the Discussion) regarding the overrepresentation of UKB and included a statement regarding the known sampling bias of UKB.

    The manual QC data is somewhat limited as it is based on a predominantly younger cohort (mean age ~30yrs). Furthermore, the number of outcome measures (cortical thickness and subcortical volume) and the number of data modalities (only structural MRI) are limited. However, as the authors also state, these limitations can hopefully be addressed by incorporating new/additional data sets into the reference models as they become available.

    We have added further details regarding the quality checking procedure to the methods section and improved the clarity of directions for implementing the scripts, including an interactive link to view an example of the manual QC environment, on the QC GitHub page to enable others to reproduce our manual QC pipeline.

    Reviewer #2 (Public Review):

    1. The evidence that the model will generalize ("transfer" as per the authors) to new, unseen sites, is very limited. To robustly support the claim that the model generalizes to data from new sites, a cross-validation evaluation with a "leave-one-site-out" (or leave-K-sites-out) folding strategy seems unavoidable, so that at each cross-validation split completely unseen sites are tested (for further justification of this assertion, please refer to Esteban et al., (2017)). The "transferability" of the model is left very weakly supported by figures 3 and 4, which interpretation is very unclear. This point is further developed below, regarding the overrepresentation of the UK Biobank dataset.

    We thank the reviewers for this suggestion and have addressed the concern regarding generalizability in several ways. First, we ran an additional 10 randomized train/test splits of the data in the full sample. These new analyses show the stability of our models, as there is very little variation in the evaluation metrics across all 10 splits. These results are visualized in Figure 3 – Supplement 2. However, the static Figure 3 – Supplement 2 is challenging to read, simply because there are many brain regions fit into a single plot. Therefore, we also created an online interactive visualization tool that shows the image of the brain region and the explained variance when you hover over a point (see the screenshot of the online tool below). This interactive visualization was created for all supplemental tables for easier exploration and interpretations and we now recommend this tool as the primary method to interrogate our findings interactively. Second, we updated and expanded the transfer data set to include 6 open datasets from OpenNeuro.org (N=546) and we provide this example dataset on our GitHub with the transfer code. This simultaneously provides a more comprehensive evaluation of the performance of our model on unseen data and more comprehensive walk-through for new users applying our models to new data (sites unseen in training). Finally, we added per-site evaluation metrics (Figure 3 – Supplement 3) to demonstrate that performance is relatively stable across sites and not driven by a single large site (i.e., UKB). As site is strongly correlated with age, these visualizations can also be used to approximate model performance at different age ranges (i.e., 9–10-year-old performance can be assessed by looking at ABCD sites evaluation metrics, and 50–80-year-old performance can be assessed by looking at UKB evaluation metrics). Moreover, we would also like to emphasize that we should not expect that all sites achieve the same performance because the sampling of the different sites is highly heterogeneous in that some sites cover a broad age range (e.g., OASIS, UKB) whereas other sites have an extremely narrow age range (e.g., ABCD).

    1. If I understand the corresponding tables correctly, it seems that UK biobank data account for roughly half of the whole dataset. If the cross-validation approach is not considered, at the very (very) least, more granular analyses of the evaluation on the test set should be provided, for example, plotting the distribution of prediction accuracy per site, to spot whether the model is just overfitted to the UKB sample. For instance, in Figure 4 it would be easy to split row 2 into UKB and "other" sites to ensure both look the same.

    We have addressed this comment in response to Reviewer 1 above.

    1. Beyond the outstanding work of visually assessing thousand of images, the Quality Control areas of the manuscript should be better executed, and particularly lines 212-233): 3.a. The overall role of the mQC dataset is unclear. QC implies a destructive process in which subpar examples of a given dataset (or a product) are excluded and dropped out of the pipeline, but that doesn't seem the case of the mQC subset, that seems a dataset with binary annotations of the quality of the FreeSurfer outcomes and the image.

    We have addressed this in response to Reviewer 1 above. We included the manual QC in this work, because in prior work by our group (https://www.biorxiv.org/content/10.1101/2021.05.28.446120v1.abstract) that leveraged big data and relied on automated QC, reviewers often criticized this approach and claimed our results could be driven by poor quality data. Thus, in this work we wanted to compare the evaluation metrics of a large, automated QC data set with the manual QC dataset to show very similar performance.

    3.b The visual assessment protocol is insufficiently described for any attempt to reproduce: (i) numbers of images rated by author SR and reused from the ABCD's accept/reject ratings; (ii) of those rated by author SR, state how the images were selected (randomly, stratified, etc.) and whether site-provenance, age, etc. were blinded to the rater; (iii) protocol details such as whether the rater navigated through slices, whether that was programmatic or decided per-case by the rater, average time eyeballing an image, etc; (iv) rating range (i.e., accept/reject) and assignment criteria; (v) quality assurance decisions (i.e., how the quality annotations are further used)

    These details have been added to the methods section where we describe the manual QC process. We have also updated the QC GitHub with more detailed instructions for using and include a link to view an example of the manual QC environment.

    3.c Similarly, the integration within the model and/or the training/testing of the automated QC is unclear. The responses to Reviewer 1 above and our revisions to the methods section should also clarify this. In brief, QC was performed on the data prior to splitting of the data to assess generalizability.

    Additional comments

    • Repeated individuals: it seems likely that there are repeated individuals, at least within the UKB and perhaps ABCD. This could be more clearly stated, indicating whether this is something that was considered or, conversely, that shouldn't influence the analysis. We have clarified in the methods section that no repeated subjects were used in the dataset.
    • Figure 3 - the Y-axis of each column should have a constant range to allow the suggested direct comparison. We have changed Figure 3 to have a constant range across all test sets.
    • Tables 5 through 8 are hard to parse - They may be moved to CSV files available somewhere under a CC-BY or similarly open license, and better interpreted with figures that highlight the message distilled from these results.

    We agree with the reviewer about the difficulty in summarizing such a large number of results in an easily digestible manner and that tables are not the optimal format to achieve this. Therefore, we have created interactive visualizations for Tables 5-8 that make exploring the evaluation metrics much easier. All the CSV files are also hosted on our GitHub page in the metrics folder (https://github.com/predictive-clinical-neuroscience/braincharts/tree/master/metrics).

    • Lines 212-214 about the QA/QC problem in neuroimaging are susceptible to misinterpretation. That particular sentence tries to bridge between the dataset description and the justification for the mQC sample and corresponding experiments. However, it fails in that objective (as I noted as a weakness, it's unclear the connection between the model and QC), and also misrepresents the why and how of QC overall.

    We have considerably expanded upon our motivation for using a manual QC approach and the steps this entails, which should address this issue.

    • The fact that the code or data are accessible doesn't mean they are usable. Indeed, the lack of license on two of the linked repositories effectively pre-empts reuse. Please state a license on them. We thank the reviewer for this suggestion. We have updated both repositories to include a license file.
    • Figure 1 - caption mentions a panel E) that seems missing in the figure.

    We have corrected this mistake in the caption of Figure 1.

    • There is no comment on the adaptations taken to execute FreeSurfer on the first age range of the sample (2-7 yo.).

    We did not make adaptations of the Freesurfer pipeline for this age range and have added this to the limitation section.

    • Following up on weakness 3.c, while scaling and centering is a sensible thing to do, it's likely that those pruned outliers actually account for much of the information under investigation. Meaning, EC is a good proxy for manual rating - but Rosen et al. demonstrate this on human, neurotypical, adult brains. Therefore, general application must be dealt with care. For example, elderly and young populations will, on average, show substantially more images with excessive motion. These images will go through FreeSurfer, and often produce an outlier EC, while a few will yield a perfectly standard EC. Typically, these cases with standard ECs are probably less accurate on the IDPs being analyzed, for example, if prior knowledge biased more the output for the hidden properties of this subject. In other words, in these cases, a researcher would be better off actually including the outliers.

    This is an important point to raise. We agree with the reviewer that the Euler Characteristic is likely correlated with pathology in addition to data quality (e.g., due to movement artefacts) and this is important to consider when modeling clinical populations but also ensure high quality data. First, we point out that the inclusion threshold is mostly important for the estimation of the normative model, which in our work – like Rosen et al – is based on healthy control data. It is easy to repeat predictions for subsequent clinical samples using a more lenient inclusion threshold (or none at all) in cases where this consideration might be operative. Second, in an effort to strike the right balance, we have chosen the EC threshold quite conservatively in that it excludes subjects that are very far into the tail of the (rescaled and centered) EC histogram. This means that we are likely dropping only subjects with true topological defects. This is also an important motivation for the careful manual QC procedures we describe above. That said, we acknowledge that any heuristic is necessarily imperfect, which we acknowledge in the limitations section and in the methods.

    • Title: "high precision" - it is unclear what precision this is qualifying as high. Is it spatial effectively granularity for a large number of ROIs being modeled or is it because the spread of the normative charts is narrow along the lifespan and as compared to some standard of variability.

    We refer to spatial precision in terms of the granularity of the regions of interest that we estimate models for. We have revised the manuscript throughout to make this more explicit.

  2. Evaluation Summary:

    This manuscript is of broad interest to the neuroimaging community. It establishes a detailed reference model of human brain development and lifespan trajectories based on a very large data set, across many cortical and subcortical brain regions. The model not only explains substantial variability on test data, it also successfully uncovers individual differences on a database of psychiatric patients that, in addition to group-level analyses, may be critical for diagnosis, thereby demonstrating high clinical potential. It presents a clear overview of the data resource, including detailed evaluation metrics, and makes code, models and documentation directly available to the community.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1, Reviewer #2 and Reviewer #3 agreed to share their name with the authors.)

  3. Reviewer #1 (Public Review):

    This work provides a set of reference models based on a large population sample that combines neuroimaging data from over 80 different scan sites. Collective modelling of these data results in representative trajectories of brain development and the effects of ageing covering the full human lifespan. The models are not restricted to a certain clinical condition and model outputs can be used to assess variability on the subject level as well as differences between groups.

    It provides a set of more detailed models than other comparable approaches modelling over 180 distinct brain regions and thus allowing to investigate patterns of spatial variability in individuals and across different clinical conditions.

    The conclusions of this paper are well supported and the mathematical details of the employed methodology have been presented elsewhere. The authors provide well documented and easy to use software tools that can be used for further, independent validation.

    If maintained and updated regularly, this has the potential to become a very valuable resource for a wide range of future studies.

    Strengths:

    Including (at present) over 50k structural MRI scans from 82 different scan sites, the data cover, essentially, the full human lifespan. Careful automated and manual quality control for a substantial part of the data should further rule out any strong effects due to artefacts.

    Instead of whole-brain summaries, the models applied in this work consider a wide range of anatomically distinct brain regions and clinically relevant outcome measures. In this sense, this is the largest resource of its kind providing lifespan reference trajectories and allowing for both individual and group-based predictions.

    The accompanying code and documentation is truly accessible through comprehensive tutorial pages and example scripts that can easily be executed locally or in a browser.

    Weaknesses:

    The assessment of model outputs is mainly focused on train-test splits stratified by site. Results from a transfer experiment to a new site are presented briefly, albeit with little detail. To better understand (and ideally quantify) scanner and other bias or confounding effects, a more comprehensive analysis would be needed. For the community to use this as a reference, it should be clear how generalisable it is to new data. One way to do this would be to train on one set of sites and test on another (disjoint) set of sites.

    The underlying data are dominated by data from the UK Biobank, which means that, in effect, only few samples for the 25-50 age group are available. This may not be a big issue in terms of estimating smooth trajectories, but may limit comparisons to the reference model in certain cases (e.g. early disease onset) where this age range may be of particular interest.

    The manual QC data is somewhat limited as it is based on a predominantly younger cohort (mean age ~30yrs). Furthermore, the number of outcome measures (cortical thickness and subcortical volume) and the number of data modalities (only structural MRI) are limited. However, as the authors also state, these limitations can hopefully be addressed by incorporating new/additional data sets into the reference models as they become available.

  4. Reviewer #2 (Public Review):

    This manuscript presents a herculean effort in extracting and modeling Imaging Derived Phenotypes (IDPs; Alfaro-Almagro, 2018) from anatomical MRI data of the human brain from a large sample of 58,836 images corresponding to individuals of ages 2-100, compiled from a total of 82 previously existing datasets. The objective of this modeling was to define standard (authors refer to this as "normative" and add a disclaimer in the discussion that the term should be avoided) distributions of a set of target IDPs across the lifespan. The manuscript shows the clinical utility of the model on a transdiagnostic psychiatric sample (N=1,985), in that "individual-level deviations provide complementary information to the group effects." The full extent of this application is however left for further work based on this paper. Finally, in the discussion, it is highlighted that the model "can easily be transferred to new sites," which indeed is a fundamental aspect, as the model should generalize to data coming from new (unseen by the model) acquisition sites - a major culprit of almost every current MRI study.

    ## Strengths
    1. The problem is important, and the establishment of these standard distributions of IDPs across time is critical to better describe the healthy brain development trajectories while providing clinicians with a powerful differential tool to aid diagnosis of atypical and diseased brains.
    2. The proposed analysis entails a massive feature extraction approach that, albeit based on the widely-used FreeSurfer software (i.e., not implemented by the authors themselves), requires very careful tracking of the computational execution of neuroimaging workflows and their outcomes.
    3. The choice of model seems reasonable and it is effectively shown that it indeed resolves the problem at hand.
    4. The manual quality control (QC) by author SR also deserves recognition, visually assessing and (I'm assuming as nothing is said otherwise) manually bookkeeping the QC annotations of thousands of subjects.
    5. Prior work noted by the Editors is referenced, and the results are discussed in relation to that prior work.
    6. The manuscript is well-written - the organization, accessibility, length, clarity, and flow are overall very adequate.

    ## Weaknesses
    1. The evidence that the model will generalize ("transfer" as per the authors) to new, unseen sites, is very limited. To robustly support the claim that the model generalizes to data from new sites, a cross-validation evaluation with a "leave-one-site-out" (or leave-K-sites-out) folding strategy seems unavoidable, so that at each cross-validation split completely unseen sites are tested (for further justification of this assertion, please refer to Esteban et al., (2017)). The "transferability" of the model is left very weakly supported by figures 3 and 4, which interpretation is very unclear. This point is further developed below, regarding the over-representation of the UK Biobank dataset.
    2. If I understand the corresponding tables correctly, it seems that UK biobank data account for roughly half of the whole dataset. If the cross-validation approach is not considered, at the very (very) least, more granular analyses of the evaluation on the test set should be provided, for example, plotting the distribution of prediction accuracy per site, to spot whether the model is just overfitted to the UKB sample. For instance, in Figure 4 it would be easy to split row 2 into UKB and "other" sites to ensure both look the same.
    3. Beyond the outstanding work of visually assessing thousand of images, the Quality Control areas of the manuscript should be better executed, and particularly lines 212-233):
    3.a. The overall role of the mQC dataset is unclear. QC implies a destructive process in which subpar examples of a given dataset (or a product) are excluded and dropped out of the pipeline, but that doesn't seem the case of the mQC subset, that seems a dataset with binary annotations of the quality of the FreeSurfer outcomes and the image.
    3.b The visual assessment protocol is insufficiently described for any attempt to reproduce: (i) numbers of images rated by author SR and reused from the ABCD's accept/reject ratings; (ii) of those rated by author SR, state how the images were selected (randomly, stratified, etc.) and whether site-provenance, age, etc. were blinded to the rater; (iii) protocol details such as whether the rater navigated through slices, whether that was programmatic or decided per-case by the rater, average time eyeballing an image, etc; (iv) rating range (i.e., accept/reject) and assignment criteria; (v) quality assurance decisions (i.e., how the quality annotations are further used)
    3.c Similarly, the integration within the model and/or the training/testing of the automated QC is unclear.

    ## Additional comments
    - Repeated individuals: it seems likely that there are repeated individuals, at least within the UKB and perhaps ABCD. This could be more clearly stated, indicating whether this is something that was considered or, conversely, that shouldn't influence the analysis.
    - Figure 3 - the Y-axis of each column should have a constant range to allow the suggested direct comparison.
    - Tables 5 through 8 are hard to parse - They may be moved to CSV files available somewhere under a CC-BY or similarly open license, and better interpreted with figures that highlight the message distilled from these results.
    - Lines 212-214 about the QA/QC problem in neuroimaging are susceptible to misinterpretation. That particular sentence tries to bridge between the dataset description and the justification for the mQC sample and corresponding experiments. However, it fails in that objective (as I noted as a weakness, it's unclear the connection between the model and QC), and also misrepresents the why and how of QC overall.
    - The fact that the code or data are accessible doesn't mean they are usable. Indeed, the lack of license on two of the linked repositories (https://github.com/predictive-clinical-neuroscience/braincharts and https://github.com/saigerutherford/lifespan_qc_scripts) effectively preempts reuse. The journal guidelines Please state a license on them. We provide some food for thought about how to choose a license, and why we set the licenses we use in our projects here: https://www.nipreps.org/community/licensing/.
    - Figure 1 - caption mentions a panel E) that seems missing in the figure.
    - There is no comment on the adaptations taken to execute FreeSurfer on the first age range of the sample (2-7 yo.).
    - Following up on weakness 3.c, while scaling and centering is a sensible thing to do, it's likely that those pruned outliers actually account for much of the information under investigation. Meaning, EC is a good proxy for manual rating - but Rosen et al. demonstrate this on human, neurotypical, adult brains. Therefore, general application must be dealt with care. For example, elderly and young populations will, on average, show substantially more images with excessive motion. These images will go through FreeSurfer, and often produce an outlier EC, while a few will yield a perfectly standard EC. Typically, these cases with standard ECs are probably less accurate on the IDPs being analyzed, for example, if prior knowledge biased more the output for the hidden properties of this subject. In other words, in these cases, a researcher would be better off actually including the outliers.
    - Title: "high precision" - it is unclear what precision this is qualifying as high. Is it spatial granularity for a large number of ROIs being modeled or is it because the spread of the normative charts is narrow along the lifespan and as compared to some standard of variability.

  5. Reviewer #3 (Public Review):

    This work develops reference cohort models of lifespan trajectories of cortical thickness and subcortical volume from an impressively large sample of 58k subjects. Providing reference models as the authors do here is a significant advantage for the field. Providing sex specific models is important. This work is focused on regional cortical thickness and subcortical volumes which is excellent. The data shown in Figure 2 is quite impressive and in my mind somewhat unexpected. Significant differences are apparent for each of the patient categories ADHD, ASD, EP, etc (Fig. 2c) and the results indicate clear differences as a function of category. This work forms an excellent basis for further work on differences between and within these populations. Overall this is the type of work we need to see coming out of these very large datasets.