A FAIR/FHIR aligned, reproducible multi omics and epidemiology framework for privacy preserving clinical research
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
High dimensional omics, longitudinal clinical data, and surveillance streams are now routine in translational research, yet reproducibility and governance often lag behind. We present a framework developed at Institut Pasteur that makes analysis auditable, portable, and privacy preserving. It couples containerised Nextflow/Snakemake pipelines for bulk and single cell RNA sequencing, proteomics and metabolomics with rigorous quality control, ComBat batch harmonisation that preserves biological covariates, and a biostatistics layer (generalised linear and mixed effects models, Cox survival, hierarchical components) with explicit false discovery rate control. An epidemiology module computes vaccine effectiveness (VE), attack rates, effective reproduction number Rt, and case control odds ratios with robust intervals. Metadata follow FAIR principles and map to HL7 FHIR resources; identifiers are pseudonymised under role based access control. On synthetic exemplars shaped by real cohort constraints, we report concrete outcomes: batch attributable variance drops from 18.7% to 3.9%, differential expression signals remain stable, a LASSO gene expression signature achieves outer fold ROC AUC 0.81 ± 0.02 with calibration slope 0.98 and intercept 0.01, VE rises from 0.63 to 0.79 as Rt falls from 1.14 to 0.89, and covariate imbalance (median SMD) reduces from 0.21 to 0.06 after matching. End to end runs reproduce byte identically across laptops and HPC under Docker. The framework is a conservative baseline for clinical bioinformatics and methods teaching, not a medical device.