Synthetic data as a method for increasing reproducibility and transparency in educational research
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Open data are often regarded as an important step towards improving the reproducibility and transparency of educational science. Yet, data sharing remains rare, and without open data, statistical analyses often remain irreproducible. In this article, we provide an introduction to synthetic data, a statistical technique based on multiple imputation (MI) that can be used to create simulated copies of the data that can be shared even when the original data cannot. To this end, we discuss reproducibility-related challenges of synthetic data and outline different approaches for generating synthetic data, including conventional and data-augmented MI (DA-MI) approaches to synthetic data. Furthermore, we conducted a case study using data from the PISA 2018 study, in which we aimed to address several challenges with synthetic data in educational research, such as missing data, multilevel data, and complex sampling designs. Our results indicate that these challenges can be addressed with relatively simple tools and that synthetic data can reproduce the results in a variety of statistical analyses. Finally, we discuss remaining challenges and directions for future research.