Toward Reliable Synthetic Omics: Statistical Distances for Generative Models Evaluation

Raffaele Marchesi
Nicolò Lazzaro
Gianluca Leonardi
Federica Rignanese
Stefano Bovo
Marco Chierici
Giuseppe Jurman

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Synthetic data generation is emerging as an approach to overcome the limitations of real-world data scarcity in omics studies, especially in precision medicine and oncology. Omics datasets, with their high dimensionality and relatively small sample sizes, often lead to overfitting, especially in deep learning models. Generative models offer a promising way to generate realistic synthetic data preserving the original data distribution. However, there is still no objective consensus on how to evaluate their performance. In this study, we set out to validate generative networks for transcriptomics data generation by using statistical distances as robust evaluation metrics.

Results

We observe that statistical distances enable simultaneous evaluation of global and local data fidelity of generated synthetic data. Because these distances satisfy the properties of true metrics, they also enable formal hypothesis testing to assess whether generative models have in fact converged or are merely approaching the reference distribution. Crucially, optimizing for these distances was found to implicitly select models maximizing other widely used metrics of generative performance, providing evidence of their broad applicability. Overall, our findings indicate that the adoption of these metrics can play a key role in guiding the development of generative models across a wide range of domains.

Version published to 10.1101/2025.05.08.652855v1 on bioRxiv
May 13, 2025

Variational Autoencoders for Metabolomics: Data Imputation, Deconfounding, and Correlation Discovery

This article has 6 authors:
1. Huang Lin
2. Lijun Zhang
3. Alexander Aksenov
4. Alan Jarmusch
5. Iris Lee
6. James T. Morton
This article has no evaluationsLatest version Apr 27, 2025
SynVerse: A Framework for Systematic Evaluation of Deep Learning Based Drug Synergy Prediction Models

This article has 3 authors:
1. Nure Tasnina
2. Maryam Haghani
3. T. M. Murali
This article has no evaluationsLatest version May 1, 2025
Evaluating the Quality of Synthetic Data in Health Care

This article has 8 authors:
1. Ivana Nanevski
2. Maryam Mohebi
3. Sebastian Jäger
4. Karen Otte
5. Matthias Schulte-Althoff
6. Fabian Prasser
7. Daniel Fürstenau
8. Felix Biessmann
This article has no evaluationsLatest version Apr 25, 2025

Listed in

Abstract

Background

Results

Article activity feed

Related articles

Variational Autoencoders for Metabolomics: Data Imputation, Deconfounding, and Correlation Discovery

SynVerse: A Framework for Systematic Evaluation of Deep Learning Based Drug Synergy Prediction Models

Evaluating the Quality of Synthetic Data in Health Care