Sample Size Estimation for Machine Learning via Monte Carlo Simulation: Learning Curves and Power Laws

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Estimating the minimum sample size needed to reach a target performance in machine learning models is a recurring problem in applied research. Unlike many traditional statistical settings, where tools such as GPower can provide closed-form answers, machine learning rarely offers analytic formulas for sample-size planning. In this article, we propose a Monte Carlo simulation approach that uses empirical learning curves and power-law fitting to estimate the required sample size for a given, fixed modeling pipeline. The method uses a Monte Carlo resampling design with a fixed test set, evaluates several training sample sizes through repeated stochastic runs, then: (1) builds learning curves, (2) fits a power-law model to extrapolate, and (3) identifies the sample size that satisfies explicit criteria for mean performance, probability of success, and stability. We implement the procedure in R through the ml_sample_size() function in the easyML package, supporting both classification and regression with common algorithms (e.g.,Random Forest, XGBoost, SVM, GLM) for tabular data typical in the social, health, and behavioral sciences. A key limitation is that the recommended sample size n∗ is conditional on the chosen hyperparameter configuration. When hyperparameter tuning is planned, n∗ should be treated as a lower bound, and additional observations may be needed to reflect the added complexity tuning typically introduces. Finally, we provide applied examples using simulated and real datasets to show how the method can support prospective study planning.

Article activity feed