Synthetic Data Generation and Nonparametric Techniques for Assessing Multivariate Similarity to Address Small-Sample Size Challenges

John Heine
Erin Fowler
Steven Eschrich
Michael J. Schell

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Data modeling in biomedical research often operates in the small-sample regime, where the number of observations is small relative to the data dimensionality; the detrimental effects of limited sample sizes are well documented in cancer studies. Synthetic data offers a potential solution to data shortfalls provided that the data generated is an adequate facsimile of the underlying distribution; the adequacy of such synthetic data remains an open-ended problem. In this work, we evaluate a synthetic generator proposed previously. The generator applies a series of transformations to the observed data to accommodate the small-sample size resulting in an uncoupled representation, where uncorrelated marginal distributions are modeled with optimized univariate kernel density estimation.

In this report, (1) we develop a nonparametric method for assessing multivariate similarity based on the Cramér-Wold theorem and random projection testing, (2) investigate when the absence of bivariate correlation approximates independence in a non-normal setting, and (3) evaluate artifacts induced by data compression. The presentation is primarily methodological; low-dimensional data were used so each stage of the generation process could be analyzed explicitly.

A formal testing framework was developed by comparing random projection level outcomes with a two-sample test, modeling these outcomes as Bernoulli trials, aggregating replicate outcomes within each projection direction, and pooling outcomes across many directions, yielding a scalable standardized normal test-statistic. The key innovation was decoupling the two-sample test significance level from that governing finalized normal inference. We showed the same projection framework also evaluates the full multivariate covariance structure. The generator produced high-fidelity multivariate synthetic data when the bivariate correlation approximates independence in the non-normal setting; in highly compressed data, residual modes were best modeled as normally distributed regardless of their intrinsic distributional form. Ongoing work includes applying these methods to higher-dimensional, diverse data.

Version published to 10.64898/2026.05.04.722226 on bioRxiv
May 7, 2026

Synthetic-data augmented calibration for expert-informed rare disease models

This article has 9 authors:
1. Hanning Yang
2. Timo Rachel
3. Tim Litwin
4. Meropi Karakioulaki
5. Antonia Reimer-Taschenbrecker
6. Jens Timmer
7. Cristina Has
8. Harald Binder
9. Moritz Hess
This article has no evaluationsLatest version May 20, 2026
Revisiting Reconstruction Likelihood: Variational Autoencoders for Biological and Biomedical Data Clustering

This article has 3 authors:
1. Andrej Korenić
2. Ufuk Özkaya
3. Abdulkerim Çapar
This article has no evaluationsLatest version Apr 12, 2026
Robust Random Forests for Genomic Prediction: Challenges and Remedies

This article has 3 authors:
1. Vanda M. Lourenço
2. Joseph O. Ogutu
3. Hans-Peter Piepho
This article has no evaluationsLatest version Apr 1, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Synthetic-data augmented calibration for expert-informed rare disease models

Revisiting Reconstruction Likelihood: Variational Autoencoders for Biological and Biomedical Data Clustering

Robust Random Forests for Genomic Prediction: Challenges and Remedies