Evaluating the Quality of Synthetic Data in Health Care

Ivana Nanevski
Maryam Mohebi
Sebastian Jäger
Karen Otte
Matthias Schulte-Althoff
Fabian Prasser
Daniel Fürstenau
Felix Biessmann

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Machine Learning (ML) research in healthcare remains challenging as large, privacy preserving open data sets are lacking. Synthetic data could offer a solution – but the value of synthetic data depends on diverse and conflicting criteria such as utility, fidelity, and privacy, which are rarely evaluated comprehensively. To close this gap, we explore the trade-off between these metrics in an empirical evaluation across a broad spectrum of generative models, datasets and metrics. Our results indicate that no single generative model excels across all metrics and datasets. In contrast, we find that for every dataset, different generative methods work best -- highlighting the need for automation for synthetic data methods. We investigate the dependency between privacy and utility metrics and demonstrate that the first two main variance directions of all metrics capture the trade-offs between fidelity, utility, and privacy well enough to support design choices of generative models in healthcare.

Version published to 10.21203/rs.3.rs-6320382/v1 on Research Square
Apr 25, 2025

Assessing synthetic data generation utility for cohort data secondary use

This article has 6 authors:
1. Mattia Mazzoli
2. Janis Elfert
3. Paolo Sacerdoti
4. Marco Hirsch
5. Michael Davis Tira
6. Daniela Paolotti
This article has no evaluationsLatest version Jun 9, 2025
A generalized Data Quality Assessment Framework for Diverse Health Datasets with varied Contradiction Rules

This article has 8 authors:
1. Khalid O. YUSUF
2. Katharina JÖRẞ
3. Jendrik RICHTER
4. Sabine HANSS
5. Irina CHAPLINSKKAYA-SOBOL
6. Robert KOSSEN
7. Lennart GRAF
8. Dagmar KREFTING
This article has no evaluationsLatest version Jun 23, 2025
ClinicalStatAI: A Cloud-Based, AI-Augmented Platform for Accessible Survival Analysis in Healthcare

This article has 1 author:
1. Fadhaa Ali
This article has no evaluationsLatest version Jun 18, 2025

Listed in

Abstract

Article activity feed

Related articles

Assessing synthetic data generation utility for cohort data secondary use

A generalized Data Quality Assessment Framework for Diverse Health Datasets with varied Contradiction Rules

ClinicalStatAI: A Cloud-Based, AI-Augmented Platform for Accessible Survival Analysis in Healthcare