A Quantitative Comparison of Structural and Distributional Properties of Synthetic Tabular Data in Parkinson’s Disease
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Parkinson’s disease (PD) research relies heavily on patient data, but access is often limited by privacy concerns, data scarcity, and collection costs. Synthetic data generation offers a potential solution, but its utility hinges on rigorously evaluated fidelity to real-world data. This study quantitatively assesses the structural and distributional fidelity of synthetic tabular data designed to represent PD patients.
Methods
We compared a synthetically generated dataset (N=500 hypothetical entries) against an anonymized real-world dataset (N=57 PD patients) containing demographics, clinical scores (UPDRS, MoCA), and mobility data (6MWT-related variables). The evaluation focused on three key quantitative metrics: (1) Column Correlation Stability, measured by the average absolute difference between Pearson correlation matrices, assessed overall and for clinically relevant variable subgroups (6MWT, UPDRS, MoCA); (2) Principal Component Analysis (PCA), evaluating the variance captured by the top principal components in both datasets; and (3) Jensen-Shannon Distance (JSD), quantifying the distributional similarity between real and synthetic variables across different groups.
Results
The overall average absolute correlation difference between the real and synthetic datasets was 0.049, indicating moderate preservation of pairwise variable relationships globally. However, stability varied across subgroups, with the 6MWT group showing higher fidelity (difference ∼0.044) compared to the UPDRS (∼0.080) and MoCA (0.081) groups. PCA revealed that the first two principal components captured 21.36% and 16.36% of the variance, respectively, with visual analysis showing partial overlap between real and synthetic data clusters. Average JSD values indicated moderate distributional similarity overall, with the MoCA group exhibiting the highest fidelity (JSD = 0.0573), while Demographics (0.1167), Clinical (0.1256), and 6MWT (0.1175) groups showed lower distributional similarity.
Conclusion
Synthetic data generation techniques can replicate univariate distributional properties of PD patient data with moderate success, particularly for certain variable types like cognitive assessments (MoCA). However, accurately capturing the complex multivariate correlation structures, crucial for understanding symptom interactions and building predictive models, remains a significant challenge, especially within specific clinical domains like UPDRS. While synthetic data holds promise for addressing data access issues in PD research, particularly for tasks less sensitive to correlation structure, its application requires careful, context-specific validation. Further development is needed to enhance the structural fidelity of synthetic tabular data for high-stakes, multivariate clinical research applications.