Evaluating Fidelity and Machine Learning Utility of Synthetic Tabular Data Generated Using Generative Models

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Synthetic tabular data offers a promising solution for enabling privacy-preserving machine learning in sensitive domains such as healthcare. However, assessing the fidelity and utility of such data remains challenging. In this study, we evaluate four generative models—CTGAN, TVAE, Gaussian Copula, and CopulaGAN—on a benchmark dataset for stroke prediction. We propose a two-phase generation and evaluation framework that combines statistical diagnostics with feature-level fidelity analysis and downstream classification performance. Our findings highlight significant variation across models, with TVAE and Gaussian Copula achieving superior fidelity and generalization. The results demonstrate that high structural similarity does not always guarantee practical machine learning utility.

Article activity feed