Assessing synthetic data generation utility for cohort data secondary use
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Access to cohort data is critical for reproducing studies, validating hypotheses in new settings and testing new approaches with past data in light of new findings. Nevertheless, previously collected data are often inaccessible due to confidentiality and privacy standards. In recent years, thanks to the development of AI, the field of synthetic data generation has emerged as a possible solution to facilitate cohort data sharing. However, synthetic data generation faces a tradeoff between statistical fidelity and information disclosure, which becomes even more restrictive for high dimensionality data.
Here we assess the feasibility of employing state-of-the-art data synthesization and anonymization techniques to generate high-fidelity and privacy preserving synthetic datasets capable of meaningfully reproducing study results, thus benchmarking the potential of synthetic cohort data for public sharing.
We design a protocol relying on four public packages, seven privacy metrics and five synthesization algorithms. We employ data collected within the framework of the Verdi project (Influweb) and publicly available hospitalisation cohort datasets (MIMIC-III) to assess privacy levels preserved by multiple data synthetization algorithms in three different studies. We employ multiple privacy metrics to ensure that statistical fidelity in our datasets do not come at the cost of information disclosure. Finally, we qualitatively compare the similarity of results reproduced with the synthetic datasets, noting that determinants of health-related behavior and in-hospital mortality remain largely unchanged with respect to estimates performed on the original data.
Our findings show that synthetic data generation is a promising technique for public data sharing and study reproducibility. However, its broad application for new exploratory studies on complex or rare patterns may introduce limitations and biases, as fidelity of previously unobserved statistical relationships is not guaranteed.