Privacy-by-design generation of two virtual clinical trials in multiple sclerosis and their release as open datasets
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Sharing information provided by individual patient data is restricted by regulatory frameworks due to privacy concerns. Generative artificial intelligence could generate shareable virtual patient populations, as proxies of sensitive reference datasets. Explicit demonstration of privacy is demanded. Here, we determined whether a privacy-by-design technique called “avatars” can generate synthetic randomized clinical trials (RCTs). We generated 2160 synthetic datasets from two RCTs in multiple sclerosis (NCT00213135 and NCT00906399) with different configurations to select one synthetic dataset with optimal privacy and utility for each. Several privacy metrics were computed, including protection against distance-based membership inference attacks. We assessed utility by comparing variable distributions and checking that all of the endpoints reported in the publications had the same effect directions, were within the reported 95% confidence intervals, and had the same statistical significance. Protection against membership inference attacks was the hardest privacy metric to optimize, but the technique yielded robust privacy and replication of the primary endpoints. With optimized generation configurations, we could select one dataset from each RCT replicating all efficacy endpoints of the placebo and commercial treatment arms with a satisfying privacy. To show the potential to unlock health data sharing, we released both placebo arms as open datasets.