Toward Robust Neuroanatomical Normative Models: Influence of Sample Size and Covariates Distributions

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife Assessment

    This important manuscript evaluates how sample size and demographic balance of reference cohorts affect the reliability of normative models. The evidence supporting the conclusions is convincing, although some additional analysis and clarifications could improve the generalisability of the conclusions. This work will be of interest to clinicians and scientists working with normative models.

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

Abstract

Normative models are increasingly used to characterize individual-level brain deviations in neuroimaging studies, but their performance depends heavily on the reference sample used for training or adaptation. In this study, we systematically investigated how sample size and covariate composition of the reference cohort influence model fit, deviation estimates, and clinical readouts in Alzheimer’s disease (AD). Using a discovery dataset (OASIS-3, n = 1032), we trained models on healthy control (HC), subsamples ranging from 5 to 600 individuals, while varying age and sex distributions to simulate biases in reference populations. We further assessed the use of adaptive transfer learning by pre-training models on the UK Biobank (n = 42,747) and adapting them to the clinical dataset applying the same sub-sampling strategies. We evaluated model performance on a fixed HC test set and quantified deviation score errors, outlier detection, and classification accuracy in both the HC test set and the AD cohort. The findings were replicated in an external validation sample (AIBL, n = 463). Across all settings, model performance improved with increasing sample size, but demographic alignment of the covariates, particularly in age, was essential for reliable deviation estimates. Models trained directly within the dataset achieved stable fit with approximately 200 HCs, while adapted models reached comparable performance with as few as 50 individuals when pre-trained on large-scale data. These results show that robust individual-level modeling can be achieved using moderately sized but demographically matched cohorts, supporting broader application of normative modeling in ageing and neurodegeneration research.

Article activity feed

  1. eLife Assessment

    This important manuscript evaluates how sample size and demographic balance of reference cohorts affect the reliability of normative models. The evidence supporting the conclusions is convincing, although some additional analysis and clarifications could improve the generalisability of the conclusions. This work will be of interest to clinicians and scientists working with normative models.

  2. Reviewer #1 (Public review):

    Summary:

    Overall, this is a well-designed and carefully executed study that delivers clear and actionable guidance on the sample size and representative demographic requirements for robust normative modelling in neuroimaging. The central claims are convincingly supported.

    Strengths:

    The study has multiple strengths. First, it offers a comprehensive and methodologically rigorous analysis of sample size and age distribution, supported by multiple complementary fit indices. Second, the learning-curve results are compelling and reproducible and will be of immediate utility to researchers planning normative modelling projects. Third, the study includes both replication in an independent dataset and an adaptive transfer analysis from UK Biobank, highlighting both the robustness of the results and the practical advantages of transfer learning for smaller clinical cohorts. Finally, the clinical validation ties the methodological work back to clinical application.

    Weaknesses:

    There are two minor points for consideration:

    (1) Calibration of percentile estimates could be shown for the main evaluation (similar to that done in Figure 4E). Because the clinical utility of normative models often hinges on identifying individuals outside the 5th or 95th percentiles, readers would benefit from visual overlays of model-derived percentile curves on the curves from the full training data and simple reporting of the proportion of healthy controls falling outside these bounds for the main analyses (i.e., 2.1. Model fit evaluation).

    (2) The larger negative effect of left-skewed sampling likely reflects a mismatch between the younger training set and the older test set; accounting explicitly for this mismatch would make the conclusions more generalisable.

  3. Reviewer #2 (Public review):

    Summary:

    The authors test how sample size and demographic balance of reference cohorts affect the reliability of normative models in ageing and Alzheimer's disease. Using OASIS-3 and replicating in AIBL, they change age and sex distributions and number of samples and show that age alignment is more important than overall sample size. They also demonstrate that models adapted from a large dataset (UK Biobank) can achieve stable performance with fewer samples. The results suggest that moderately sized but demographically well-balanced cohorts can provide robust performance.

    Strengths:

    The study is thorough and systematic, varying sample size, age, and sex distributions in a controlled way. Results are replicated in two independent datasets with relatively large sample sizes, thereby strengthening confidence in the findings. The analyses are clearly presented and use widely applied evaluation metrics. Clinical validation (outlier detection, classification) adds relevance beyond technical benchmarks. The comparison between within-cohort training and adaptation from a large dataset is valuable for real-world applications.

    The work convincingly shows that age alignment is crucial and that adapted models can reach good performance with fewer samples. However, some dataset-specific patterns (noted above) should be acknowledged more directly, and the practical guidance could be sharper.

    Weaknesses:

    The paper uses a simple regression framework, which is understandable for scalability, but limits generalization to multi-site settings where a hierarchical approach could better account for site differences. This limitation is acknowledged; a brief sensitivity analysis (or a clearer discussion) would help readers weigh trade-offs. Other than that, there are some points that are not fully explained in the paper:

    (1) The replication in AIBL does not fully match the OASIS results. In AIBL, left-skewed age sampling converges with other strategies as sample size grows, unlike in OASIS. This suggests that skew effects depend on where variability lies across the age span.

    (2) Sex imbalance effects are difficult to interpret, since sex is included only as a fixed effect, and residual age differences may drive some errors.

    (3) In Figure 3, performance drops around n≈300 across conditions. This consistent pattern raises the question of sensitivity to individual samples or sub-sampling strategy.

    (4) The total outlier count (tOC) analysis is interesting but hard to generalize. For example, in AIBL, left-skew sometimes performs slightly better despite a weaker model fit. Clearer guidance on how to weigh model fit versus outlier detection would strengthen the practical message.

    (5) The suggested plateau at n≈200 seems context-dependent. It may be better to frame sample size targets in relation to coverage across age bins rather than as an absolute number.

  4. Author response

    We would like to thank the editors and two reviewers for the assessment and the constructive feedback on our manuscript, “Toward Robust Neuroanatomical Normative Models: Influence of Sample Size and Covariates Distributions”. We appreciate the thorough reviews and believe the constructive suggestions will substantially strengthen the clarity and quality of our work. We plan to submit a revised version of the manuscript and a full point-by-point response addressing both the public reviews and the recommendations to the authors.

    Reviewer 1.

    In revision, we plan to address the reviewer’s comments by: (i) strengthen the interpretation of model fit through reporting the proportion of healthy controls within and outside the extreme percentile bounds; (ii) adding age-resolved overlays of model-derived percentile curves compared to those from the full reference cohort for key sample sizes and regions; (iii) quantifying age-distribution alignment between train and test set; and (iv) summarizing model performance as a joint function of age-distribution alignment and sample size.

    Reviewer 2.

    In the revised manuscript, we will (i) expand the Discussion to more clearly outline the trade-offs between simple regression frameworks and hierarchical models for normative modeling (e.g., scalability, handling of multi-site variation, computational considerations), and discuss alternative approaches and harmonization as important directions for multi-site settings; (ii) contextualize OASIS-3 vs AIBL differences by quantifying train– test age-alignment across sampling strategies and emphasize that skewness should be interpreted relative to the target cohort’s alignment rather than absolute numbers. (iii) reassess sex-imbalance effects by reporting expected age distributions per condition and re-evaluate sex effects while controlling for age; (iv) investigate the apparent dip at n≈300 dip by increasing sub-sampling seeds, testing neighboring sample sizes, and using an alternative age-binning scheme to clarify the observed artifact; (v) clarify potential divergence between tOC separation and global fit under discrepancies in demographic distributions and relate tOC to age-alignment distance; (vi) reframe the sample-size guidance in terms of distributional alignment rather than an absolute n.