Evaluating Fidelity and Machine Learning Utility of Synthetic Tabular Data Generated Using Generative Models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Synthetic tabular data offers a promising solution for enabling privacy-preserving machine learning in sensitive domains such as healthcare. However, assessing the fidelity and utility of such data remains challenging. In this study, we evaluate four generative models—CTGAN, TVAE, Gaussian Copula, and CopulaGAN—on a benchmark dataset for stroke prediction. We propose a two-phase generation and evaluation framework that combines statistical diagnostics with feature-level fidelity analysis and downstream classification performance. Our findings highlight significant variation across models, with TVAE and Gaussian Copula achieving superior fidelity and generalization. The results demonstrate that high structural similarity does not always guarantee practical machine learning utility.