Synthetic RNA-seq cohorts for data sharing: a discovery-aware benchmark at transcriptome scale
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Sharing patient-level gene expression data is essential for translational discovery but carries documented re-identification risks. Bulk RNA-seq count matrices can retain genotypic signals and paired clinical metadata compounds this through quasi-identifier matching. Synthetic RNA-seq cohorts offer a complementary path for privacy-preserving data sharing, but the field lacks a multi-axis benchmark that probes biological fidelity and empirical privacy risk at transcriptome scale. Here we present a multi-axis benchmark framework that reflects how transcriptomic cohorts are used in translational practice.
Methods
We benchmarked three generative models across four cohorts drawn from datasets spanning oncology (TCGA-LUAD), sepsis (GSE184900), and pediatric IBD (RISK/GSE57945): dbTwin (a non-deep-learning, target-conditioned method that operates natively at RNA-seq scale), class-MVN (a low-rank target-conditioned multivariate Gaussian model), and PCA-CTGAN (a tabular GAN trained in PCA-compressed space). Synthetic cohorts were generated from training folds of a five-fold stratified design. We evaluated DE genes recovery, log 2 FC and significance (padj) concordance, held-out AUC (TSTR) and SHAP concordance and distance-based memorization risk.
Results
class-MVN recovered 64.8 % and 43.1 % of real DE genes in the two binary cohorts with high fold-change correlation but lower significance concordance ( r = 0.24 – 0.68 ) and inflated DE gene counts. dbTwin recovered 78.7 % and 91.8 % of real DE genes in the same cohorts, with high fold-change correlation and stronger significance concordance ( r ≥ 0.88 ). Both methods matched held-out real AUC under TSTR, but SHAP agreement differed substantially: dbTwin preserved feature attribution patterns across cohorts (SHAP top-50 genes r = 0.84 – 0.99 across two binary and two multiclass cohorts), whereas class-MVN showed moderate performance for majority classes but degraded in multiclass and imbalanced settings (SHAP r = 0.31 – 0.79 ). PCA-CTGAN performed poorly across most DE and ML metrics. Distance-toclosest-record analysis did not indicate memorization by any of the models.
Conclusions
We introduced a multi-axis, transcriptome-scale, discovery-aware benchmark to validate synthetic RNA-seq cohorts for translational workflows and evaluated three generative models across four real-world cohorts. These results support the use of synthetic RNA-seq cohorts for exploratory analysis and method development, while emphasizing the need for careful validation before use in higher-stakes applications. All benchmark code and data are available at https://github.com/Nanda-Aditya/rna-syn-bench .