Reducing Information and Selection Bias in EHR-Linked Biobanks via Genetics-Informed Multiple Imputation and Sample Weighting

Maxwell Salvatore
Ritoban Kundu
Jiacong Du
Christopher R Friese
Alison M Mondul
David Hanauer
Haidong Lu
Celeste Leigh Pearce
Bhramar Mukherjee

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Electronic health records (EHRs) are valuable for public health and clinical research but are prone to many sources of bias, including missing data and non-probability selection. Missing data in EHRs is complex due to potential non-recording, fragmentation, or clinically informative absences. This study explores whether polygenic risk score (PRS)-informed multiple imputation for missing traits, combined with sample weighting, can mitigate missing data and selection biases in estimating disease-exposure associations. Simulations were conducted for missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) conditions under different sampling mechanisms. PRS-informed multiple imputation showed generally lower bias, particularly when combined with sample weighting. For example, in biased samples of 10,000 with exposure and outcome MAR data, PRS-informed imputation had lower percent bias (3.8%) and better coverage rate (0.883) compared to PRS-uninformed (4.5%; 0.877) and complete case analyses (10.3%; 0.784) in covariate-adjusted, weighted, multiple imputation scenarios. In a case study using Michigan Genomics Initiative (n=50,026) data, PRS-informed imputation aligned more closely with a sample-weighted All of Us-derived benchmark than analyses ignoring missing data and selection bias. Researchers should consider leveraging genetic data and sample weighting to address biases from missing data and non-probability sampling in biobanks.

Version published to 10.1101/2024.10.28.24316286 on medRxiv
Oct 29, 2024

Evaluating Imputation Methods for Handling Missing Data in Complex Survey Designs: Evidence from the India DHS 2017–18

This article has 6 authors:
1. Mahfuzer Rohman
2. Md Sabbir Hossain
3. Md Fakrul Islam
4. Prosenjit Basak Arka
5. Md Rafi Hasan
6. Md Jamal Uddin
This article has no evaluationsLatest version Jan 23, 2026
Missing Data in OHCA Registries: How Multiple Imputation Methods Affect Research Conclusions—Paper II

This article has 4 authors:
1. Stella Jinran Zhan
2. Seyed Ehsan Saffari
3. Marcus Eng Hock Ong
4. Fahad Javaid Siddiqui
This article has no evaluationsLatest version Jan 16, 2026
An Advanced Entropy Approach for Minimizing False Discoveries in Imputation-Based Association Analyses

This article has 4 authors:
1. Zhihui Zhang
2. Dakai Zhu
3. Xiangjun Xiao
4. Christopher I. Amos
This article has no evaluationsLatest version Dec 17, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Evaluating Imputation Methods for Handling Missing Data in Complex Survey Designs: Evidence from the India DHS 2017–18

Missing Data in OHCA Registries: How Multiple Imputation Methods Affect Research Conclusions—Paper II

An Advanced Entropy Approach for Minimizing False Discoveries in Imputation-Based Association Analyses