Privacy-Enhancing Sequential Learning under Heterogeneous Selection Bias in Multi-Site EHR Data

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Objective

To develop privacy-enhancing statistical methods for estimation of binary disease risk model association parameters across multiple electronic health record (EHR) sites with heterogeneous selection mechanisms, without sharing raw individual-level data. We illustrate their utility through a cross-biobank analysis of smoking and 97 cancer subtypes using data from the NIH All of Us (AOU) and the Michigan Genomics Initiative (MGI).

Materials and Methods

Large-scale biobanks often follow heterogeneous recruitment strategies and store data in separate cloud-based platforms, making centralized algorithms infeasible. To address this, we propose two decentralized sequential estimators namely, Sequential Pseudo-likelihood (SPL) and Sequential Augmented Inverse Probability Weighting (SAIPW) that leverage external population-level information to adjust for selection bias, with valid variance estimation. SAIPW additionally protects against misspecification of the selection model using flexible machine learning based auxiliary outcome models. We compare SPL and SAIPW with the existing Sequential Unweighted (SUW) estimator and with centralized and meta learning extensions of IPW and AIPW in simulations under both correctly specified and misspecified selection mechanisms. We apply the methods to harmonized data from MGI ( n = 50,935) and AOU ( n = 241,563) to estimate smoking-cancer associations.

Results

In simulations, SUW exhibited substantial bias and poor coverage. SPL and SAIPW yielded unbiased estimates with valid coverage probabilities under correct model specification, with SAIPW remaining robust under selection model misspecification. Both approaches showed no notable efficiency loss relative to centralized methods. Meta-learning methods were efficient for large sites but failed in settings with small cohort sizes and rare outcome prevalence. In real-data analysis, strong associations were consistently identified between smoking and cancers of the lung, bladder, and larynx, aligning with established epidemiological evidence.

Conclusion

Our framework enables valid, privacy-enhancing inference across EHR cohorts with heterogeneous selection, supporting scalable, decentralized research using real-world data.

Article activity feed