Synthetic data as a method for increasing reproducibility and transparency in educational research

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Open data are often regarded as an important step towards improving the reproducibility and transparency of educational science. Yet, data sharing remains rare, and without open data, statistical analyses often remain irreproducible. In this article, we provide an introduction to synthetic data, a statistical technique based on multiple imputation (MI) that can be used to create simulated copies of the data that can be shared even when the original data cannot. To this end, we discuss reproducibility-related challenges of synthetic data and outline different approaches for generating synthetic data, including conventional and data-augmented MI (DA-MI) approaches to synthetic data. Furthermore, we conducted a case study using data from the PISA 2018 study, in which we aimed to address several challenges with synthetic data in educational research, such as missing data, multilevel data, and complex sampling designs. Our results indicate that these challenges can be addressed with relatively simple tools and that synthetic data can reproduce the results in a variety of statistical analyses. Finally, we discuss remaining challenges and directions for future research.

Article activity feed