High-Fidelity Synthetic Data Replicates Clinical Prediction Performance in a Million-Patient Diabetes Cohort
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Synthetic data generated using generative models trained on real clinical data offers a promising solution to privacy concerns in health research. However, many efforts are limited by small or demographically narrow training datasets, reducing the generalizability of the synthetic data. To address this, we used real-world clinical data from nearly one million individuals with diabetes in the Andalusian Population Health Database (BPS) to generate a comprehensive longitudinal synthetic dataset.
We employed a dual adversarial autoencoder to produce synthetic data and evaluated its utility in a clinical machine learning (ML) task: predicting the onset of chronic kidney disease, a common diabetes complication. Models trained on synthetic data were assessed for their ability to reproduce patterns and predictive behaviors observed in real data. Performance and stability were compared across models trained on real, synthetic, and hybrid datasets. Models trained exclusively on synthetic data achieved AUROC scores comparable to real-data models (0.70 vs. 0.73) and showed high stability in feature importance rankings (weighted Kendall’s τ > 0.9). Notably, combining synthetic and real data did not improve performance.
Our findings demonstrate that high-fidelity synthetic longitudinal data can replicate real data performance in clinical ML, supporting its use in research while preserving patient privacy. This represents a significant step toward more collaborative and privacy-preserving healthcare data ecosystems.