Utility is all you need: fidelity-agnostic synthetic data generation

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Synthesizing data has been popularized as a method to publish useful datasets in a privacy-aware manner, making it useful across a range of scientific domains involving human subjects like health, social, financial, and other applied sciences. It is typically generated by sampling from algorithms which mimic the probability distribution of real datasets, thereby maximizing statistical similarity to real data. However, we argue and demonstrate that synthetic data only needs to be similar in ways relevant to its intended use and may neglect any irrelevant information, which in turn may improve privacy protection. As such, we address the tension between synthesizing data which is useful and protective of privacy, by proposing a new data synthesis method entitled Fidelity Agnostic Synthetic Data. The method first extracts relevant features to the dataset's intended use using a neural net, then generates synthetic versions of the extracted features, after which they are decoded to mimic the real dataset. We show that our synthetic data improves performance in prediction tasks, whilst retaining privacy protection compared to other state-of-the-art methods. This result holds across datasets from a variety of scientific disciplines which benefit from privacy protection, further underscoring the potential of our method in human subject research.

Article activity feed