Synthetic RNA-seq cohorts for data sharing: a discovery-aware benchmark at transcriptome scale

Aditya Nanda
Somdutta Saha

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Sharing patient-level gene expression data is essential for translational discovery but carries documented re-identification risks. Bulk RNA-seq count matrices can retain genotypic signals and paired clinical metadata compounds this through quasi-identifier matching. Synthetic RNA-seq cohorts offer a complementary path for privacy-preserving data sharing, but the field lacks a multi-axis benchmark that probes biological fidelity and empirical privacy risk at transcriptome scale. Here we present a multi-axis benchmark framework that reflects how transcriptomic cohorts are used in translational practice.

Methods

We benchmarked three generative models across four cohorts drawn from datasets spanning oncology (TCGA-LUAD), sepsis (GSE184900), and pediatric IBD (RISK/GSE57945): dbTwin (a non-deep-learning, target-conditioned method that operates natively at RNA-seq scale), class-MVN (a low-rank target-conditioned multivariate Gaussian model), and PCA-CTGAN (a tabular GAN trained in PCA-compressed space). Synthetic cohorts were generated from training folds of a five-fold stratified design. We evaluated DE genes recovery, log ₂ FC and significance (padj) concordance, held-out AUC (TSTR) and SHAP concordance and distance-based memorization risk.

Results

class-MVN recovered 64.8 % and 43.1 % of real DE genes in the two binary cohorts with high fold-change correlation but lower significance concordance ( r = 0.24 – 0.68 ) and inflated DE gene counts. dbTwin recovered 78.7 % and 91.8 % of real DE genes in the same cohorts, with high fold-change correlation and stronger significance concordance ( r ≥ 0.88 ). Both methods matched held-out real AUC under TSTR, but SHAP agreement differed substantially: dbTwin preserved feature attribution patterns across cohorts (SHAP top-50 genes r = 0.84 – 0.99 across two binary and two multiclass cohorts), whereas class-MVN showed moderate performance for majority classes but degraded in multiclass and imbalanced settings (SHAP r = 0.31 – 0.79 ). PCA-CTGAN performed poorly across most DE and ML metrics. Distance-toclosest-record analysis did not indicate memorization by any of the models.

Conclusions

We introduced a multi-axis, transcriptome-scale, discovery-aware benchmark to validate synthetic RNA-seq cohorts for translational workflows and evaluated three generative models across four real-world cohorts. These results support the use of synthetic RNA-seq cohorts for exploratory analysis and method development, while emphasizing the need for careful validation before use in higher-stakes applications. All benchmark code and data are available at https://github.com/Nanda-Aditya/rna-syn-bench .

Version published to 10.64898/2026.05.22.726357 on bioRxiv
May 26, 2026

RNABag: A Generalizable Transcriptome Foundation Model for Precision Oncology across Biopsy Modalities

This article has 7 authors:
1. Pengchao Luo
2. Dong Luo
3. Dan Li
4. Xiangyang Xue
5. Jianbo Yang
6. Xuejun Gong
7. Kun Tang
This article has no evaluationsLatest version Apr 22, 2026
A unified benchmark of synthetic data generation for clinical transcriptomic cancer cohorts

This article has 4 authors:
1. The-Chuong Trinh
2. Jean-Baptiste Woillard
3. Guido Uguzzoni
4. Christophe Battail
This article has no evaluationsLatest version May 16, 2026
Cross-assay RNA modeling reveals cancer biomarkers

This article has 10 authors:
1. Hope A. Townsend
2. Kimberly R. Jordan
3. Rebecca J. Wolsky
4. Lucy B. Van Kleunen
5. Natalie R. Davidson
6. Kian Behbakht
7. Matthew J. Sikora
8. Robin D. Dowell
9. Aaron Clauset
10. Benjamin G. Bitler
This article has no evaluationsLatest version May 5, 2026

Discuss this preprint

Listed in

Abstract

Background

Methods

Results

Conclusions

Article activity feed

Related articles

RNABag: A Generalizable Transcriptome Foundation Model for Precision Oncology across Biopsy Modalities

A unified benchmark of synthetic data generation for clinical transcriptomic cancer cohorts

Cross-assay RNA modeling reveals cancer biomarkers