A novel pipeline for realistic synthetic longitudinal EHR data generation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Synthetic health data offers a promising means of sharing clinical information without compromising patient privacy. However, existing methods often produce outputs that differ in structure from real data and are evaluated in narrow contexts, limiting their practical use in downstream analytical workflows. This study introduces a pipeline that builds upon existing methods for generating realistic synthetic longitudinal electronic health record data, evaluates it across three diverse datasets, and offers evidence-based guidance on the use of synthetic data to replace or augment real data. Methods The pipeline extends existing state of the art HALO and ConSequence frameworks with a post-processing step that reconstructs continuous variables and timestamps, producing synthetic data that closely matches the structure of real medical record datasets. It was applied to three clinically diverse datasets: a small longitudinal cohort, a medium-sized intensive-care dataset, and a very large multi-hospital administrative dataset. Realism was assessed alongside utility for machine learning, statistical modelling, and time series analysis tasks. Results Across all datasets, the pipeline generated realistic synthetic data that preserved key statistical properties and relationships. Machine learning models trained on synthetic data achieved similar predictive accuracy and feature importance patterns to those trained on real data, indicating strong utility. Synthetic data also performed well in statistical modelling, with the direction and magnitude of effects generally closely aligned with the real data. However, it may be less suitable when precise estimates are required or when modelling relatively rare conditions. Importantly, although the pipeline reconstructed timestamp structures, it did not capture aggregate temporal patterns and the resulting data was therefore unsuitable for time series analysis. Conclusions The pipeline produces realistic and analytically useful synthetic longitudinal electronic health record data across datasets of widely varying scales. These findings provide practical guidance on when synthetic data can meaningfully substitute for or complement real data.