Evaluating the Quality of Synthetic Data in Health Care

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Machine Learning (ML) research in healthcare remains challenging as large, privacy preserving open data sets are lacking. Synthetic data could offer a solution – but the value of synthetic data depends on diverse and conflicting criteria such as utility, fidelity, and privacy, which are rarely evaluated comprehensively. To close this gap, we explore the trade-off between these metrics in an empirical evaluation across a broad spectrum of generative models, datasets and metrics. Our results indicate that no single generative model excels across all metrics and datasets. In contrast, we find that for every dataset, different generative methods work best -- highlighting the need for automation for synthetic data methods. We investigate the dependency between privacy and utility metrics and demonstrate that the first two main variance directions of all metrics capture the trade-offs between fidelity, utility, and privacy well enough to support design choices of generative models in healthcare.

Article activity feed