Causal Inference via Electronic Health Records in the National Clinical Cohort Collaborative: Challenges and Solutions in Long COVID Research

Zachary Butzin-Dozier
Yunwen Ji
Lin-Chiun Wang
A. Jerrod Anzalone
Eric Hurwitz
Rena C. Patel
Mark J. van der Laan
John M. Colford
Alan E. Hubbard

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Observational analyses of electronic health record (EHR) data using databases such as the National Clinical Cohort Collaborative include unique challenges for researchers seeking causal inferences, particularly when evaluating subjectively-defined outcomes like Long COVID. We explore several challenges and describe potential solutions. 1. Lack of true negatives: Many diagnoses and conditions either have a positive indicator or a missing status, requiring investigators to carefully consider which patients are likely negative for this condition. 2. Differential monitoring: EHR data include nonrandom missingness driven by patients engaging with the healthcare system at different rates, which is often related to both the exposure and outcome of interest. 3. Bias: EHR data sources face many biases, but are particularly vulnerable to informative missingness, differential monitoring, and model misspecification. 4. Large sample size: High precision (i.e., narrow confidence intervals) paired with potential bias leads to a high risk of incorrectly rejecting the null hypothesis. 5. Defining index time: It is important that investigators deliberately define index time (i.e., t ₀ , baseline) to ensure that they only adjust for baseline confounders and do not adjust for (or condition on) factors that are affected by the exposure of interest (i.e., colliders or mediators). 6. Parameter selection: Investigators should only select parameters that are supported by the data distribution. This manuscript provides an overview of these challenges and solutions, using both simulated data and real-world data, with the outcome of Long COVID as the running example.

Version published to 10.1101/2025.06.06.25329168 on medRxiv
Jun 8, 2025

Missing Data in OHCA Registries: How Multiple Imputation Methods Affect Research Conclusions—Paper II

This article has 4 authors:
1. Stella Jinran Zhan
2. Seyed Ehsan Saffari
3. Marcus Eng Hock Ong
4. Fahad Javaid Siddiqui
This article has no evaluationsLatest version Jan 16, 2026
Evaluating Imputation Methods for Handling Missing Data in Complex Survey Designs: Evidence from the India DHS 2017–18

This article has 6 authors:
1. Mahfuzer Rohman
2. Md Sabbir Hossain
3. Md Fakrul Islam
4. Prosenjit Basak Arka
5. Md Rafi Hasan
6. Md Jamal Uddin
This article has no evaluationsLatest version Jan 23, 2026
A Hybrid Pharmacovigilance Method for National-Scale Comorbidity Discovery: Association Rules with FDA-Approved PRR/Chi-square and EBGM Validation.

This article has 1 author:
1. Kaossara Osseni
This article has no evaluationsLatest version Dec 24, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Missing Data in OHCA Registries: How Multiple Imputation Methods Affect Research Conclusions—Paper II

Evaluating Imputation Methods for Handling Missing Data in Complex Survey Designs: Evidence from the India DHS 2017–18

A Hybrid Pharmacovigilance Method for National-Scale Comorbidity Discovery: Association Rules with FDA-Approved PRR/Chi-square and EBGM Validation.