The effect of population selection criteria on model estimates and data missingness in electronic health record studies

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Objective

Electronic Health Records (EHRs) provide information to explore those at risk of various diseases, though studying entire populations is limited by data availability, potentially introducing biases. We compared different samples, varied by type of hospital contact, to assess the impact on missing data and model results.

Materials and Methods

Using Escherichia coli bloodstream infections as a case study, we used data from Oxfordshire, UK, containing individuals with hospital contact, including blood tests from primary care providers. We compared two approaches: an “inpatient sample” requiring previous/current inpatient contact, reducing missing risk factors but restricting the denominator, and a broader “healthcare contact sample”, defined by previous healthcare contact (including outpatient appointments, emergency department visits, community blood tests), maximising inclusion.

Results

The healthcare contact sample contained more missing data for key demographics and systematically missing data for potential risk factors including diagnosis codes. Missing data was more common in controls than cases (17–21% controls missing vital signs versus 6–7% cases [inpatient sample]) but varied little with lookback duration. Model estimates showed small, insubstantial shifts between the samples. Compared to population estimates, younger males were unrepresented, while those aged≥75y were largely captured.

Discussion

Including individuals with any previous healthcare contact resulted in more missing data versus restricting to inpatient-only contact. Despite this, there were only small shifts in model estimates. There were large but expected differences between EHR samples and population-level data.

Conclusion

Different sample selection approaches for EHR analyses could impact power and associations. Ideally, multiple approaches should be compared.

Article activity feed