High-Fidelity Synthetic Data Replicates Clinical Prediction Performance in a Million-Patient Diabetes Cohort

Víctor M. de la Oliva-Roque
David P. Kreil
Joaquín Dopazo
Francisco Ortuño
Carlos Loucera

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Synthetic data generated using generative models trained on real clinical data offers a promising solution to privacy concerns in health research. However, many efforts are limited by small or demographically narrow training datasets, reducing the generalizability of the synthetic data. To address this, we used real-world clinical data from nearly one million individuals with diabetes in the Andalusian Population Health Database (BPS) to generate a comprehensive longitudinal synthetic dataset.

We employed a dual adversarial autoencoder to produce synthetic data and evaluated its utility in a clinical machine learning (ML) task: predicting the onset of chronic kidney disease, a common diabetes complication. Models trained on synthetic data were assessed for their ability to reproduce patterns and predictive behaviors observed in real data. Performance and stability were compared across models trained on real, synthetic, and hybrid datasets. Models trained exclusively on synthetic data achieved AUROC scores comparable to real-data models (0.70 vs. 0.73) and showed high stability in feature importance rankings (weighted Kendall’s τ > 0.9). Notably, combining synthetic and real data did not improve performance.

Our findings demonstrate that high-fidelity synthetic longitudinal data can replicate real data performance in clinical ML, supporting its use in research while preserving patient privacy. This represents a significant step toward more collaborative and privacy-preserving healthcare data ecosystems.

Version published to 10.1101/2025.07.20.25331852 on medRxiv
Jul 21, 2025

Generative AI-Based Imputation to Preserve Data Fidelity and Enhance Outcome Prediction: A Multi-Institutional Study in Cardiac Surgery

This article has 11 authors:
1. Negin Maddah
2. Amin Ramezani
3. Qingchu Jin
4. Jakob Wollborn
5. Akinobu Itoh
6. Jaime B. Rabb
7. Felistas Mazhude
8. Robert S. Kramer
9. Douglas B. Sawyer
10. Raimond L. Winslow
11. Farhad R. Nezami
This article has no evaluationsLatest version Jan 23, 2026
A novel pipeline for realistic synthetic longitudinal EHR data generation

This article has 3 authors:
1. Gabrielle Josling
2. Ibrahima Diouf
3. Sankalp Khanna
This article has no evaluationsLatest version Jan 29, 2026
Evaluating the Utility of Synthetic Image Generation for Medical AI: A Review

This article has 3 authors:
1. Israa Atike
2. Asifa Mehmood Qureshi
3. Abhishek Kaushik
This article has no evaluationsLatest version Dec 22, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Generative AI-Based Imputation to Preserve Data Fidelity and Enhance Outcome Prediction: A Multi-Institutional Study in Cardiac Surgery

A novel pipeline for realistic synthetic longitudinal EHR data generation

Evaluating the Utility of Synthetic Image Generation for Medical AI: A Review