Privacy-Enhancing Sequential Learning under Heterogeneous Selection Bias in Multi-Site EHR Data

Ritoban Kundu
Xu Shi
Kumar Kshitij Patel
Lucila Ohno-Machado
Maxwell Salvatore
Peter X.K. Song
Bhramar Mukherjee

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Objective

To develop privacy-enhancing statistical methods for estimation of binary disease risk model association parameters across multiple electronic health record (EHR) sites with heterogeneous selection mechanisms, without sharing raw individual-level data. We illustrate their utility through a cross-biobank analysis of smoking and 97 cancer subtypes using data from the NIH All of Us (AOU) and the Michigan Genomics Initiative (MGI).

Materials and Methods

Large-scale biobanks often follow heterogeneous recruitment strategies and store data in separate cloud-based platforms, making centralized algorithms infeasible. To address this, we propose two decentralized sequential estimators namely, Sequential Pseudo-likelihood (SPL) and Sequential Augmented Inverse Probability Weighting (SAIPW) that leverage external population-level information to adjust for selection bias, with valid variance estimation. SAIPW additionally protects against misspecification of the selection model using flexible machine learning based auxiliary outcome models. We compare SPL and SAIPW with the existing Sequential Unweighted (SUW) estimator and with centralized and meta learning extensions of IPW and AIPW in simulations under both correctly specified and misspecified selection mechanisms. We apply the methods to harmonized data from MGI ( n = 50,935) and AOU ( n = 241,563) to estimate smoking-cancer associations.

Results

In simulations, SUW exhibited substantial bias and poor coverage. SPL and SAIPW yielded unbiased estimates with valid coverage probabilities under correct model specification, with SAIPW remaining robust under selection model misspecification. Both approaches showed no notable efficiency loss relative to centralized methods. Meta-learning methods were efficient for large sites but failed in settings with small cohort sizes and rare outcome prevalence. In real-data analysis, strong associations were consistently identified between smoking and cancers of the lung, bladder, and larynx, aligning with established epidemiological evidence.

Conclusion

Our framework enables valid, privacy-enhancing inference across EHR cohorts with heterogeneous selection, supporting scalable, decentralized research using real-world data.

Version published to 10.1101/2025.09.26.25336642 on medRxiv
Sep 28, 2025

Bayesian Network Structure Learning from Incomplete Breast Cancer Data Using Structural Expectation–Maximization

This article has 3 authors:
1. Navaee Lavasani Monireh
2. Rezaeitabar Vahid
3. Khayamzadeh Maryam
This article has no evaluationsLatest version Dec 10, 2025
Personalized Disease Risk Prediction from Multimodal Health Data Using Large Language Models

This article has 2 authors:
1. Hanieh Arjmand
2. Alexandre Tomberg
This article has no evaluationsLatest version Jan 25, 2026
Generative AI-Based Imputation to Preserve Data Fidelity and Enhance Outcome Prediction: A Multi-Institutional Study in Cardiac Surgery

This article has 11 authors:
1. Negin Maddah
2. Amin Ramezani
3. Qingchu Jin
4. Jakob Wollborn
5. Akinobu Itoh
6. Jaime B. Rabb
7. Felistas Mazhude
8. Robert S. Kramer
9. Douglas B. Sawyer
10. Raimond L. Winslow
11. Farhad R. Nezami
This article has no evaluationsLatest version Jan 23, 2026

Discuss this preprint

Listed in

Abstract

Objective

Materials and Methods

Results

Conclusion

Article activity feed

Related articles

Bayesian Network Structure Learning from Incomplete Breast Cancer Data Using Structural Expectation–Maximization

Personalized Disease Risk Prediction from Multimodal Health Data Using Large Language Models

Generative AI-Based Imputation to Preserve Data Fidelity and Enhance Outcome Prediction: A Multi-Institutional Study in Cardiac Surgery