Assessing synthetic data generation utility for cohort data secondary use

Mattia Mazzoli
Janis Elfert
Paolo Sacerdoti
Marco Hirsch
Michael Davis Tira
Daniela Paolotti

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Access to cohort data is critical for reproducing studies, validating hypotheses in new settings and testing new approaches with past data in light of new findings. Nevertheless, previously collected data are often inaccessible due to confidentiality and privacy standards. In recent years, thanks to the development of AI, the field of synthetic data generation has emerged as a possible solution to facilitate cohort data sharing. However, synthetic data generation faces a tradeoff between statistical fidelity and information disclosure, which becomes even more restrictive for high dimensionality data.

Here we assess the feasibility of employing state-of-the-art data synthesization and anonymization techniques to generate high-fidelity and privacy preserving synthetic datasets capable of meaningfully reproducing study results, thus benchmarking the potential of synthetic cohort data for public sharing.

We design a protocol relying on four public packages, seven privacy metrics and five synthesization algorithms. We employ data collected within the framework of the Verdi project (Influweb) and publicly available hospitalisation cohort datasets (MIMIC-III) to assess privacy levels preserved by multiple data synthetization algorithms in three different studies. We employ multiple privacy metrics to ensure that statistical fidelity in our datasets do not come at the cost of information disclosure. Finally, we qualitatively compare the similarity of results reproduced with the synthetic datasets, noting that determinants of health-related behavior and in-hospital mortality remain largely unchanged with respect to estimates performed on the original data.

Our findings show that synthetic data generation is a promising technique for public data sharing and study reproducibility. However, its broad application for new exploratory studies on complex or rare patterns may introduce limitations and biases, as fidelity of previously unobserved statistical relationships is not guaranteed.

Version published to 10.1101/2025.06.09.25329247 on medRxiv
Jun 9, 2025

A novel pipeline for realistic synthetic longitudinal EHR data generation

This article has 3 authors:
1. Gabrielle Josling
2. Ibrahima Diouf
3. Sankalp Khanna
This article has no evaluationsLatest version Jan 29, 2026
A Multidimensional Evaluation of Privacy-Preserving Generative Models for Neonatal Clinical Tabular Data: Fidelity, Utility, and Realism Trade-offs

This article has 5 authors:
1. Tb Ai Munandar
2. Tyastuti Sri Lestari
3. Achmad Noe’man
4. Alimuddin Alimuddin
5. Ria Arafiyah
This article has no evaluationsLatest version Jan 21, 2026
Generative AI-Based Imputation to Preserve Data Fidelity and Enhance Outcome Prediction: A Multi-Institutional Study in Cardiac Surgery

This article has 11 authors:
1. Negin Maddah
2. Amin Ramezani
3. Qingchu Jin
4. Jakob Wollborn
5. Akinobu Itoh
6. Jaime B. Rabb
7. Felistas Mazhude
8. Robert S. Kramer
9. Douglas B. Sawyer
10. Raimond L. Winslow
11. Farhad R. Nezami
This article has no evaluationsLatest version Jan 23, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

A novel pipeline for realistic synthetic longitudinal EHR data generation

A Multidimensional Evaluation of Privacy-Preserving Generative Models for Neonatal Clinical Tabular Data: Fidelity, Utility, and Realism Trade-offs

Generative AI-Based Imputation to Preserve Data Fidelity and Enhance Outcome Prediction: A Multi-Institutional Study in Cardiac Surgery