Synthetic RNA-seq cohorts for data sharing: a discovery-aware benchmark at transcriptome scale

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background

Sharing patient-level gene expression data is essential for translational discovery but carries documented re-identification risks. Bulk RNA-seq count matrices can retain genotypic signals and paired clinical metadata compounds this through quasi-identifier matching. Synthetic RNA-seq cohorts offer a complementary path for privacy-preserving data sharing, but the field lacks a multi-axis benchmark that probes biological fidelity and empirical privacy risk at transcriptome scale. Here we present a multi-axis benchmark framework that reflects how transcriptomic cohorts are used in translational practice.

Methods

We benchmarked three generative models across four cohorts drawn from datasets spanning oncology (TCGA-LUAD), sepsis (GSE184900), and pediatric IBD (RISK/GSE57945): dbTwin (a non-deep-learning, target-conditioned method that operates natively at RNA-seq scale), class-MVN (a low-rank target-conditioned multivariate Gaussian model), and PCA-CTGAN (a tabular GAN trained in PCA-compressed space). Synthetic cohorts were generated from training folds of a five-fold stratified design. We evaluated DE genes recovery, log 2 FC and significance (padj) concordance, held-out AUC (TSTR) and SHAP concordance and distance-based memorization risk.

Results

class-MVN recovered 64.8 % and 43.1 % of real DE genes in the two binary cohorts with high fold-change correlation but lower significance concordance ( r = 0.24 – 0.68 ) and inflated DE gene counts. dbTwin recovered 78.7 % and 91.8 % of real DE genes in the same cohorts, with high fold-change correlation and stronger significance concordance ( r ≥ 0.88 ). Both methods matched held-out real AUC under TSTR, but SHAP agreement differed substantially: dbTwin preserved feature attribution patterns across cohorts (SHAP top-50 genes r = 0.84 – 0.99 across two binary and two multiclass cohorts), whereas class-MVN showed moderate performance for majority classes but degraded in multiclass and imbalanced settings (SHAP r = 0.31 – 0.79 ). PCA-CTGAN performed poorly across most DE and ML metrics. Distance-toclosest-record analysis did not indicate memorization by any of the models.

Conclusions

We introduced a multi-axis, transcriptome-scale, discovery-aware benchmark to validate synthetic RNA-seq cohorts for translational workflows and evaluated three generative models across four real-world cohorts. These results support the use of synthetic RNA-seq cohorts for exploratory analysis and method development, while emphasizing the need for careful validation before use in higher-stakes applications. All benchmark code and data are available at https://github.com/Nanda-Aditya/rna-syn-bench .

Article activity feed