Widespread use of invalid statistical tests in biomedical machine learning

Tianchu Zeng
Hetu Li
Shaoshi Zhang
Yan Quan Tan
Fang Tian
Csaba Orban
Lijun An
Wanyu Che
Jingwen Cheng
Joanna Su Xian Chong
Niousha Dehestani
Zijian Dong
Xin Li
Zhizhou Li
Mervyn Jun Rui Lim
Yi Lin
Qinrui Ling
Zijie Ling
Xi Zhi Low
Sina Mansour L.
Eric Kwun Kei Ng
Thuan Tinh Nguyen
Leon Qi Rong Ooi
Shreya Pande
Xing Qian
Jingxuan Ruan
Ziwen Wang
Yapei Xie
Chen Zhang
Yichi Zhang
Kaustubh Patil
Linden Parkes
Elvisha Dhamala
Sidhant Chopra
Andrew Zalesky
Avram Holmes
Simon Eickhoff
Juan Helen Zhou
Olivier Renaud
Nico Dosenbach
Konrad Kording
Danilo Bzdok
Thomas E. Nichols
B.T. Thomas Yeo

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Machine learning is accelerating biomedical research. Cross-validation is widely used to compare predictive performance – not only to benchmark algorithms, but also to inform scientific applications, such as ranking biomarkers. However, prediction performance estimates across cross-validation folds are not independent. Standard tests for comparing prediction performance (e.g., paired t-test) assume independence and can therefore inflate false positive rates. In a PRISMA-guided meta-analysis of 210 studies (impact factor ≥15, 1 June 2020 – 1 June 2025), we find that 97% ignored fold dependence when comparing prediction performance. This problem is ubiquitous across scientific fields and unaffected by impact factor, rigor-promoting policies, or open science practices. Simulations across 420 scenarios spanning four diverse datasets show that ignoring fold dependence leads to invalid false positive control in most settings. Repeated cross-validation further compounds this problem, with false positive rates rising toward 100% as the number of repetitions grows. Existing fold-dependence-aware tests rely on strong assumptions because the variance of fold-level statistics and the between-fold correlation cannot be disentangled under standard cross-validation. We therefore propose the SHARP (Split-HAlf RePeated) test, a simple modification to standard cross-validation that enables direct estimation of variance and correlation. Benchmarked against 12 tests, SHARP provides the best overall balance of false-positive control, statistical power, and confidence-interval calibration across simulation schemes. We conclude by providing best practices and reporting guidelines for valid model comparison inference in biomedical machine learning and beyond.

Version published to 10.64898/2026.05.17.724301 on bioRxiv
May 20, 2026

Quantifying the Optimism of Naive Cross-Validation for Binary Outcome Prediction with Repeated-Measures Predictors: A Simulation Study and Clinical Illustration

This article has 1 author:
1. Joseph L. Hagan
This article has no evaluationsLatest version May 29, 2026
Benchmarking of Ensembles and Meta‐Ensembles in the Multiclass Classification of Obesity Risk: Predictive Performance, Calibration and Interpretability

This article has 5 authors:
1. Daniel Andrade-Girón
2. William Marin-Rodriguez
3. Américo Peña
4. Elsa Oscuvilca-Tapia
5. Fredy Bermejo-Sanchez
This article has no evaluationsLatest version Apr 10, 2026
Interpretable Predictive Modeling for Medical Data Using Boolean Rule-aware Regression

This article has 2 authors:
1. Mohammad Eskandarian
2. Seyed Amir Malekpour
This article has no evaluationsLatest version May 18, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Quantifying the Optimism of Naive Cross-Validation for Binary Outcome Prediction with Repeated-Measures Predictors: A Simulation Study and Clinical Illustration

Benchmarking of Ensembles and Meta‐Ensembles in the Multiclass Classification of Obesity Risk: Predictive Performance, Calibration and Interpretability

Interpretable Predictive Modeling for Medical Data Using Boolean Rule-aware Regression