SynthCraft: an AI partner for synthetic data generation to support data access and augmentation in healthcare

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background

Access to high-quality data provides the foundation for biomedical research. But data access is often limited or challenging due to privacy constraints, whilst the data themselves may be unrepresentative or sparse. Synthetic data can support privacy-preserving data access, data augmentation, as well as complex analytical workflows for the development of digital twins or to evaluate the impacts of data distribution shifts. However, the use of synthetic data remains limited due to the complexity of the methods themselves and their evaluation, as well as the need for advanced programming skills.

Methods

We developed SynthCraft, a tool for AI-human collaboration to support the principled, transparent, use of state-of-the-art synthetic data generation methods. SynthCraft uses Large Language Models (LLMs) combined with a reinforcement learning-based reasoning engine to orchestrate the necessary workflow to generate synthetic data based on dynamic interaction with the user using natural language. We demonstrate the capability of SynthCraft with both tabular and genomic datasets: National Health and Nutrition Examination Survey (NHANES) and the Cancer Genome Atlas (TCGA).

Results

Using SynthCraft, we analysed the privacy, statistical fidelity, and downstream utility of four different synthetic data generators both with and without explicit privacy-preserving designs when applied to both the NHANES and TCGA datasets. We show that how different generators perform differently – and that no single method was optimal – across varying use-cases and datasets. Furthermore, we demonstrate how SynthCraft can be used for data augmentation as part of a workflow to attempt to mitigate imbalances in the proportion of individuals from different ethnic backgrounds.

Conclusions

An LLM-based, human-in-the-loop, AI partner can support the generation of synthetic datasets. Such tools could improve the quality, reproducibility, and transparency of research methods, whilst increasing their accessibility. Research into their use across different methodological areas is warranted.

Article activity feed