External validation of machine learning models—registered models and adaptive sample splitting

Giuseppe Gallitto
Robert Englert
Balint Kincses
Raviteja Kotikalapudi
Jialin Li
Kevin Hoffschlag
Ulrike Bingel
Tamas Spisak

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (GigaScience)

Abstract

Background

Multivariate predictive models play a crucial role in enhancing our understanding of complex biological systems and in developing innovative, replicable tools for translational medical research. However, the complexity of machine learning methods and extensive data preprocessing and feature engineering pipelines can lead to overfitting and poor generalizability. An unbiased evaluation of predictive models necessitates external validation, which involves testing the finalized model on independent data. Despite its importance, external validation is often neglected in practice due to the associated costs.

Results

Here we propose that, for maximal credibility, model discovery and external validation should be separated by the public disclosure (e.g., preregistration) of feature processing steps and model weights. Furthermore, we introduce a novel approach to optimize the trade-off between efforts spent on model discovery and external validation in such studies. We show on data involving more than 3,000 participants from four different datasets that, for any “sample size budget,” the proposed adaptive splitting approach can successfully identify the optimal time to stop model discovery so that predictive performance is maximized without risking a low-powered, and thus inconclusive, external validation.

Conclusion

The proposed design and splitting approach (implemented in the Python package “AdaptiveSplit”) may contribute to addressing issues of replicability, effect size inflation, and generalizability in predictive modeling studies.

GigaScience
Jun 3, 2025

Multivariate predictive models play a crucial role in enhancing our understanding of complex biological systems and in developing innovative, replicable tools for translational medical research. However, the complexity of machine learning methods and extensive data pre-processing and feature engineering pipelines can lead to overfitting and poor generalizability. An unbiased evaluation of predictive models necessitates external validation, which involves testing the finalized model on independent data. Despite its importance, external validation is often neglected in practice due to the associated costs. Here we propose that, for maximal credibility, model discovery and external validation should be separated by the public disclosure (e.g. pre-registration) of feature processing steps and model weights. Furthermore, we introduce a …

Multivariate predictive models play a crucial role in enhancing our understanding of complex biological systems and in developing innovative, replicable tools for translational medical research. However, the complexity of machine learning methods and extensive data pre-processing and feature engineering pipelines can lead to overfitting and poor generalizability. An unbiased evaluation of predictive models necessitates external validation, which involves testing the finalized model on independent data. Despite its importance, external validation is often neglected in practice due to the associated costs. Here we propose that, for maximal credibility, model discovery and external validation should be separated by the public disclosure (e.g. pre-registration) of feature processing steps and model weights. Furthermore, we introduce a novel approach to optimize the trade-off between efforts spent on training and external validation in such studies. We show on data involving more than 3000 participants from four different datasets that, for any “sample size budget”, the proposed adaptive splitting approach can successfully identify the optimal time to stop model discovery so that predictive performance is maximized without risking a low powered, and thus inconclusive, external validation. The proposed design and splitting approach (implemented in the Python package “AdaptiveSplit”) may contribute to addressing issues of replicability, effect size inflation and generalizability in predictive modeling studies.

A version of this preprint has been published in the Open Access journal *GigaScience *(see paper (https://doi.org/10.1093/gigascience/giaf036), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

Revised 1 version

Reviewer 1: Qingyu Zhao

Thank for the authors for the thorough response. The only remaining comment is that some new supplement figures (figures 8-12) are not cited or explained in the main text (maybe I missed it?). Please make sure to discuss these supplement figures in the main text otherwise readers wouldn't know they are there. The response reads "To provide even more insights, we now present the relationship between the internally validated scores at the time of stopping (I_{act}), the corresponding external validation scores and sample sizes, for all 4 datasets in supplementary figures 8-11. The figures show a relatively good correspondence between internally and externally validated performance estimates with all splitting strategies". What insights are given? What do you mean by relatively good correspondence between internal and external performance? All I see in those figures are some normally distributed scatter plots, so it needs better explanation.

Reviewer 2: Lisa Crossman

I previously reviewed this MS and all the comments I made were answered in full. I would be pleased to recommend publication. I was fully able to replicate the adaptive split results from the GitHub repo. I have only one comment which is that I received several generated warnings of "RuntimeWarning: divide by zero encountered in scalar divide", and these can also be seen in the Jupyter notebook example.

Read the original source
GigaScience
Jun 3, 2025
Multivariate predictive models play a crucial role in enhancing our understanding of complex biological systems and in developing innovative, replicable tools for translational medical research. However, the complexity of machine learning methods and extensive data pre-processing and feature engineering pipelines can lead to overfitting and poor generalizability. An unbiased evaluation of predictive models necessitates external validation, which involves testing the finalized model on independent data. Despite its importance, external validation is often neglected in practice due to the associated costs. Here we propose that, for maximal credibility, model discovery and external validation should be separated by the public disclosure (e.g. pre-registration) of feature processing steps and model weights. Furthermore, we introduce a …
Multivariate predictive models play a crucial role in enhancing our understanding of complex biological systems and in developing innovative, replicable tools for translational medical research. However, the complexity of machine learning methods and extensive data pre-processing and feature engineering pipelines can lead to overfitting and poor generalizability. An unbiased evaluation of predictive models necessitates external validation, which involves testing the finalized model on independent data. Despite its importance, external validation is often neglected in practice due to the associated costs. Here we propose that, for maximal credibility, model discovery and external validation should be separated by the public disclosure (e.g. pre-registration) of feature processing steps and model weights. Furthermore, we introduce a novel approach to optimize the trade-off between efforts spent on training and external validation in such studies. We show on data involving more than 3000 participants from four different datasets that, for any “sample size budget”, the proposed adaptive splitting approach can successfully identify the optimal time to stop model discovery so that predictive performance is maximized without risking a low powered, and thus inconclusive, external validation. The proposed design and splitting approach (implemented in the Python package “AdaptiveSplit”) may contribute to addressing issues of replicability, effect size inflation and generalizability in predictive modeling studies.

A version of this preprint has been published in the Open Access journal *GigaScience *(see paper (https://doi.org/10.1093/gigascience/giaf036), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

Original version

Reviewer 1: Qingyu Zhao

The manuscript discusses an interesting approach that seeks optimal data split for the pre-registration framework. The approach adaptively optimizes the balance between predictive performance of discovery set and sample size of external validation set. The approach is showcased on 4 applications, demonstrating advantage over traditional fixed data split (e.g., 80/20). I generally enjoyed reading the manuscript. I believe pre-registration is one important tool for reproducible ML analysis and the ideology behind the proposed framework (investigating the balance between discovery power and validation power) is urgently needed. My main concerns are all around Fig. 3, which represents the core quantitative analysis but lacks many details.

Fig. 3 is mostly about external validation. What about training? For each n_total, which stopping rule is activated? What is the training accuracy? What does l_act look like? What is \hat{s_total}?

Results section states "the proposed adaptive splitting strategy always provided equally good or better predictive performance than the fixed splitting strategies (as shown by the 95% confidence intervals on Figure 3)". I'm confused by this because the blue curve is often below other methods in accuracy (e.g., comparing with 90/10 split in ABIDE and HCP).

Why does the half split have the lowest accuracy but the highest statistical power?

How was the range of x-axis (n_total) selected? E.g., HCP has 1000 subjects, why was 240-380 chosen for analysis?

The lowest n_total for BCW and IXI is approximately 50. If n_act starts from 10% of n_total, how is it possible to train (nested) cross-validation on 5 samples or so?

Two other general comments are:

How can this be applied to retrospective data or secondary data analysis where the collection is finished?

Is there a guidance on the minimum sample size that is required to perform such an auto-split analysis? It is surprising that the authors think the two studies with n=35 and n=38 are good examples of training generalizable ML models. It is generally hard to believe any ML analysis can be done on such low sample sizes with thousands of rs-fMRI features. By the way, I believe n=25 in Kincses 2024 if I read it correctly.

Reviewer 2: Lisa Crossman

External validation of machine learning models - registered models and adaptive sample splitting Gallito et al. The Manuscript describes a methodology and algorithm aimed at better choosing a train-test validation split of data for scikit-learn models. A python package, adaptivesplit, was built as part of this MS as a tool for others to use. The package is proposed to be used together with a suggested workflow to integrate an approach invoking registered models as a full design for better prospective modelling studies. Finally, the work is evaluated on four alternative publicly available datasets of health research data and comprehensive results are presented. There is a trade-off in the split between the amount of sample data to be used for training and the amount of data to use for validation. Ideally the content of each must be balanced in order for the trained model to be representative and equally for the validation set to be representative. This manuscript is therefore very timely due to the large increase in the use of AI models and provides important information and methodology.

This reviewer does not have the specific expertise to provide detailed comments on the statistical rule methods.

Main Suggested Revision:

The Python implementation of the "adaptivesplit" package is described as available on GitHub (Gallitto et al., n.d.). One of the major points of the paper is to provide the python package "adaptivesplit", however, this package does not have a clear hyperlink, and is not found by simple google searches, and it appears is not yet available. It is therefore not possible to evaluate it at present. There is a website found available with a preprint of this MS after further google searches, https://pnilab.github.io/adaptivesplit/ however, adaptive split is here shown as an interactivate jupyter-type notebook example and not as a python library code. Therefore, it is not clear how available the package is for others' use. Can the authors comment on the code availability?

Minor comments:

Apart from the 80:20 Pareto split of train-test data, other splits are commonly used in ratios such as 75:25 (the scikit-learn default split if ratio is unspecified), and 70:30. Also the cross-validation strategy with train-test-validation split 60:20:20, yet these strategies have not been mentioned or included in the figures such as Fig 3. The splits provided in the figure and discussed are 50:50, 80:20 and 90:10 only. Could the authors discuss alternative split ratios?
Read the original source
Version published to 10.1093/gigascience/giaf036
Jan 1, 2025
Version published to 10.1101/2023.12.01.569626 on bioRxiv
Dec 4, 2023

Explainable and Visualizable Machine Learning Model Development and Validation for 5-Year Postoperative Survival Prediction in Prostate Cancer Patients Aged ≥ 65 Years

This article has 2 authors:
1. Huaying Chen
2. TONGPING SHEN
This article has no evaluationsLatest version Feb 4, 2026
Ten Quick Tips for Biomedical Federated Learning

This article has 8 authors:
1. Kyle Ellrott
2. Venkat S. Maladi
3. Jean-Christophe Bélisle-Pipon
4. Emek Demir
5. Yael Bensoussan
6. Serghei Mangul
7. Alex A. T. Bui
8. Paul C. Boutros
This article has no evaluationsLatest version Jan 27, 2026
Development of a Machine Learning-Based Predictive Model and Clinically-Oriented Web Application for 30-Day Mortality Following Cardiac Surgery

This article has 6 authors:
1. Telmo Miguel-Medina
2. Susel Góngora Alonso
3. Isabel de la Torre Díez
4. Miriam Blanco Sáez
5. Mª Lourdes del Río Solá
6. Mohammed Amoon
This article has no evaluationsLatest version Dec 10, 2025

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Background

Results

Conclusion

Article activity feed

Related articles

Explainable and Visualizable Machine Learning Model Development and Validation for 5-Year Postoperative Survival Prediction in Prostate Cancer Patients Aged ≥ 65 Years

Ten Quick Tips for Biomedical Federated Learning

Development of a Machine Learning-Based Predictive Model and Clinically-Oriented Web Application for 30-Day Mortality Following Cardiac Surgery