SynthCraft: an AI partner for synthetic data generation to support data access and augmentation in healthcare

Thomas Callender
Anders Boyd
Robert Davis
Silas Ruhrberg Estevez
Juan Lavista Ferres
Mihaela van der Schaar

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Access to high-quality data provides the foundation for biomedical research. But data access is often limited or challenging due to privacy constraints, whilst the data themselves may be unrepresentative or sparse. Synthetic data can support privacy-preserving data access, data augmentation, as well as complex analytical workflows for the development of digital twins or to evaluate the impacts of data distribution shifts. However, the use of synthetic data remains limited due to the complexity of the methods themselves and their evaluation, as well as the need for advanced programming skills.

Methods

We developed SynthCraft, a tool for AI-human collaboration to support the principled, transparent, use of state-of-the-art synthetic data generation methods. SynthCraft uses Large Language Models (LLMs) combined with a reinforcement learning-based reasoning engine to orchestrate the necessary workflow to generate synthetic data based on dynamic interaction with the user using natural language. We demonstrate the capability of SynthCraft with both tabular and genomic datasets: National Health and Nutrition Examination Survey (NHANES) and the Cancer Genome Atlas (TCGA).

Results

Using SynthCraft, we analysed the privacy, statistical fidelity, and downstream utility of four different synthetic data generators both with and without explicit privacy-preserving designs when applied to both the NHANES and TCGA datasets. We show that how different generators perform differently – and that no single method was optimal – across varying use-cases and datasets. Furthermore, we demonstrate how SynthCraft can be used for data augmentation as part of a workflow to attempt to mitigate imbalances in the proportion of individuals from different ethnic backgrounds.

Conclusions

An LLM-based, human-in-the-loop, AI partner can support the generation of synthetic datasets. Such tools could improve the quality, reproducibility, and transparency of research methods, whilst increasing their accessibility. Research into their use across different methodological areas is warranted.

Version published to 10.1101/2025.08.17.25333866 on medRxiv
Aug 19, 2025

Exploring the Role of Synthetic Data in the Future of AI in Healthcare: A Scoping Review of Frameworks, Challenges, and Implications

This article has 4 authors:
1. Mohammad Ishtiaque Rahman
2. Razuan Hossain
3. S.M. Sayem
4. Forhan Bin Emdad
This article has no evaluationsLatest version Aug 5, 2025
On the use of variational autoencoders for biomedical data integration

This article has 4 authors:
1. Marc Pielies Avellí
2. Ricardo Hernández Medina
3. Henry Webel
4. Simon Rasmussen
This article has no evaluationsLatest version Aug 22, 2025
Coherent Cross-modal Generation of Synthetic Biomedical Data to Advance Multimodal Precision Medicine

This article has 10 authors:
1. Raffaele Marchesi
2. Nicolò Lazzaro
3. Walter Endrizzi
4. Gianluca Leonardi
5. Matteo Pozzi
6. Flavio Ragni
7. Stefano Bovo
8. Monica Moroni
9. Venet Osmani
10. Giuseppe Jurman
This article has no evaluationsLatest version Aug 27, 2025

Listed in

Abstract

Background

Methods

Results

Conclusions

Article activity feed

Related articles

Exploring the Role of Synthetic Data in the Future of AI in Healthcare: A Scoping Review of Frameworks, Challenges, and Implications

On the use of variational autoencoders for biomedical data integration

Coherent Cross-modal Generation of Synthetic Biomedical Data to Advance Multimodal Precision Medicine