A synthetic data generation pipeline to reproducibly mirror high-resolution multi-variable peptidomics and real-patient clinical data

Mayra Alejandra Jaimes Campos
Stipe Kabić
Agnieszka Latosinska
Ena Anicic
Justyna Siwy
Vinko Dragušica
Harald Rupprecht
Lorenzo Catanese
Felix Keller
Paul Perco
Enrique Gomez- Gomez
Joachim Beige
Antonia Vlahou
Harald Mischak
Davorin Vukelić
Tomislav Križan
Maria Frantzi

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Generating high quality, real-world clinical and molecular datasets is challenging, costly and time intensive. Consequently, such data should be shared with the scientific community, which however carries the risk of privacy breaches. The latter limitation hinders the scientific community’s ability to freely share and access high resolution and high quality data, which are essential especially in the context of personalised medicine. In this study, we present an algorithm based on Gaussian copulas to generate synthetic data that retain associations within high dimensional (peptidomics) datasets. For this purpose, 3,881 datasets from 10 cohorts were employed, containing clinical, demographic, molecular (> 21,500 peptide) variables, and outcome data for individuals with a kidney or a heart failure event. High dimensional copulas were developed to portray the distribution matrix between the clinical and peptidomics data in the dataset, and based on these distributions, a data matrix of 2,000 synthetic patients was developed. Synthetic data maintained the capacity to reproducibly correlate the peptidomics data with the clinical variables. Consequently, correlation of the rho-values of individual peptides with eGFR between the synthetic and the real-patient datasets was highly similar, both at the single peptide level (rho = 0.885, p < 2.2e-308) and after classification with machine learning models (rho _synthetic = -0.394, p = 5.21e-127; rho _real = -0.396, p = 4.64e-67). External validation was performed, using independent multi-centric datasets (n = 2,964) of individuals with chronic kidney disease (CKD, defined as eGFR < 60 mL/min/1.73m²) or those with normal kidney function (eGFR > 90 mL/min/1.73m²). Similarly, the association of the rho-values of single peptides with eGFR between the synthetic and the external validation datasets was significantly reproduced (rho = 0.569, p = 1.8e-218). Subsequent development of classifiers by using the synthetic data matrices, resulted in highly predictive values in external real-patient datasets (AUC values of 0.803 and 0.867 for HF and CKD, respectively), demonstrating robustness of the developed method in the generation of synthetic patient data. The proposed pipeline represents a solution for high-dimensional sharing while maintaining patient confidentiality.

Version published to 10.1101/2024.10.30.24316342 on medRxiv
Nov 4, 2024

A novel pipeline for realistic synthetic longitudinal EHR data generation

This article has 3 authors:
1. Gabrielle Josling
2. Ibrahima Diouf
3. Sankalp Khanna
This article has no evaluationsLatest version Jan 29, 2026
PRESSnet: a novel framework for patient stratification and biomarker discovery using clinical knowledge graphs

This article has 11 authors:
1. Jake Cohen-Setton
2. Shruti Shikhare
3. Ioannis Kagiampakis
4. Domingo Salazar
5. Miguel Goncalves
6. Elizabeth Coker
7. Sanddhya Jayabalan
8. Damian Bikiel
9. Ben Sidders
10. Etai Jacob
11. Krishna Bulusu
This article has no evaluationsLatest version Dec 15, 2025
Cross-Platform Reproducible Modeling of Breast Cancer Prognosis Using the Core-PAM50 Gene Signature

This article has 2 authors:
1. Rafael de Negreiros Botan
2. Joao Batista de Sousa
This article has no evaluationsLatest version Dec 19, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

A novel pipeline for realistic synthetic longitudinal EHR data generation

PRESSnet: a novel framework for patient stratification and biomarker discovery using clinical knowledge graphs

Cross-Platform Reproducible Modeling of Breast Cancer Prognosis Using the Core-PAM50 Gene Signature