A Quantitative Comparison of Structural and Distributional Properties of Synthetic Tabular Data in Parkinson’s Disease

Shahryar Wasif
Farhan Raza
Dhruvil Patel
Taylor Chomiak
Bin Hu

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Parkinson’s disease (PD) research relies heavily on patient data, but access is often limited by privacy concerns, data scarcity, and collection costs. Synthetic data generation offers a potential solution, but its utility hinges on rigorously evaluated fidelity to real-world data. This study quantitatively assesses the structural and distributional fidelity of synthetic tabular data designed to represent PD patients.

Methods

We compared a synthetically generated dataset (N=500 hypothetical entries) against an anonymized real-world dataset (N=57 PD patients) containing demographics, clinical scores (UPDRS, MoCA), and mobility data (6MWT-related variables). The evaluation focused on three key quantitative metrics: (1) Column Correlation Stability, measured by the average absolute difference between Pearson correlation matrices, assessed overall and for clinically relevant variable subgroups (6MWT, UPDRS, MoCA); (2) Principal Component Analysis (PCA), evaluating the variance captured by the top principal components in both datasets; and (3) Jensen-Shannon Distance (JSD), quantifying the distributional similarity between real and synthetic variables across different groups.

Results

The overall average absolute correlation difference between the real and synthetic datasets was 0.049, indicating moderate preservation of pairwise variable relationships globally. However, stability varied across subgroups, with the 6MWT group showing higher fidelity (difference ∼0.044) compared to the UPDRS (∼0.080) and MoCA (0.081) groups. PCA revealed that the first two principal components captured 21.36% and 16.36% of the variance, respectively, with visual analysis showing partial overlap between real and synthetic data clusters. Average JSD values indicated moderate distributional similarity overall, with the MoCA group exhibiting the highest fidelity (JSD = 0.0573), while Demographics (0.1167), Clinical (0.1256), and 6MWT (0.1175) groups showed lower distributional similarity.

Conclusion

Synthetic data generation techniques can replicate univariate distributional properties of PD patient data with moderate success, particularly for certain variable types like cognitive assessments (MoCA). However, accurately capturing the complex multivariate correlation structures, crucial for understanding symptom interactions and building predictive models, remains a significant challenge, especially within specific clinical domains like UPDRS. While synthetic data holds promise for addressing data access issues in PD research, particularly for tasks less sensitive to correlation structure, its application requires careful, context-specific validation. Further development is needed to enhance the structural fidelity of synthetic tabular data for high-stakes, multivariate clinical research applications.

Version published to 10.1101/2025.05.02.25326890 on medRxiv
May 3, 2025

A Multidimensional Evaluation of Privacy-Preserving Generative Models for Neonatal Clinical Tabular Data: Fidelity, Utility, and Realism Trade-offs

This article has 5 authors:
1. Tb Ai Munandar
2. Tyastuti Sri Lestari
3. Achmad Noe’man
4. Alimuddin Alimuddin
5. Ria Arafiyah
This article has no evaluationsLatest version Jan 21, 2026
Parkinson’s disease in real life healthcare organization database: Medication based algorithm, incidence and prodromal symptoms

This article has 6 authors:
1. Hila Avisar
2. Ruth Djaldetti
3. Amir Krivoy
4. Anat Mirelman
5. Roy N. Alcalay
6. Nir Giladi
This article has no evaluationsLatest version Dec 22, 2025
Validation of a Patient-Reported Outcome Measure Sensitive to Diet and Nutraceutical Exposure in Parkinson’s Disease

This article has 2 authors:
1. Laurie Mischley
2. Magdalena Murawska
This article has no evaluationsLatest version Dec 29, 2025

Discuss this preprint

Listed in

Abstract

Background

Methods

Results

Conclusion

Article activity feed

Related articles

A Multidimensional Evaluation of Privacy-Preserving Generative Models for Neonatal Clinical Tabular Data: Fidelity, Utility, and Realism Trade-offs

Parkinson’s disease in real life healthcare organization database: Medication based algorithm, incidence and prodromal symptoms

Validation of a Patient-Reported Outcome Measure Sensitive to Diet and Nutraceutical Exposure in Parkinson’s Disease