Sample Size Estimation for Machine Learning via Monte Carlo Simulation: Learning Curves and Power Laws

Jose Ventura-Leon

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Estimating the minimum sample size needed to reach a target performance in machine learning models is a recurring problem in applied research. Unlike many traditional statistical settings, where tools such as GPower can provide closed-form answers, machine learning rarely offers analytic formulas for sample-size planning. In this article, we propose a Monte Carlo simulation approach that uses empirical learning curves and power-law fitting to estimate the required sample size for a given, fixed modeling pipeline. The method uses a Monte Carlo resampling design with a fixed test set, evaluates several training sample sizes through repeated stochastic runs, then: (1) builds learning curves, (2) fits a power-law model to extrapolate, and (3) identifies the sample size that satisfies explicit criteria for mean performance, probability of success, and stability. We implement the procedure in R through the ml_sample_size() function in the easyML package, supporting both classification and regression with common algorithms (e.g.,Random Forest, XGBoost, SVM, GLM) for tabular data typical in the social, health, and behavioral sciences. A key limitation is that the recommended sample size n∗ is conditional on the chosen hyperparameter configuration. When hyperparameter tuning is planned, n∗ should be treated as a lower bound, and additional observations may be needed to reflect the added complexity tuning typically introduces. Finally, we provide applied examples using simulated and real datasets to show how the method can support prospective study planning.

Version published to 10.31234/osf.io/6xrtu_v1 on OSF Preprints
Jan 24, 2026

Generalization in neural posterior estimation: Case studies with the racing diffusion model

This article has 5 authors:
1. Malte Lüken
2. Andrew Heathcote
3. Šimon Kucharský
4. Lourens Waldorp
5. Stefan T. Radev
This article has no evaluationsLatest version Jan 2, 2026
Generalization in neural posterior estimation: Case studies with the racing diffusion model

This article has 5 authors:
1. Malte Lüken
2. Andrew Heathcote
3. Šimon Kucharský
4. Lourens Waldorp
5. Stefan T. Radev
This article has no evaluationsLatest version Jan 2, 2026
Fast uncertainty quantification in EZ cognitive models

This article has 2 authors:
1. Joachim Vandekerckhove
2. Elizabeth Fox
This article has no evaluationsLatest version Jan 7, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Generalization in neural posterior estimation: Case studies with the racing diffusion model

Generalization in neural posterior estimation: Case studies with the racing diffusion model

Fast uncertainty quantification in EZ cognitive models