On the Fidelity versus Privacy and Utility Trade-Off of Synthetic Patient Data
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The use of synthetic data is a widely discussed and promising solution for privacy-preserving medical research. Synthetic data may however not always be privacy preserving and can vary greatly in terms of data fidelity and utility.
We systematically evaluate the trade-offs between privacy, fidelity, and utility across five synthetic data models and three patient-level datasets. We evaluate fidelity based on statistical similarity to the real data, utility on three machine learning use cases and privacy via membership inference, singling out, and attribute inference risks. Synthetic data without differential privacy (DP) maintained fidelity and utility without evident privacy breaches, whereas DP-enforced models significantly disrupted correlation structures. K-anonymity-based data sanitization, while preserving fidelity, introduced notable privacy risks. Our findings emphasize the need to advance methods that effectively balance privacy, fidelity, and utility in synthetic patient data generation.
Highlights
-
Differential Privacy (DP) had a detrimental effect on feature correlations
-
Models that did not implement DP showed good fidelity compared to real data
-
Non-DP synthetic models showed no strong evidence of privacy breaches
-
k-anonymization produced high fidelity data but showed notable privacy risks