Biases in Race and Ethnicity Introduced by Filtering Electronic Health Records for ‘Complete Data’

Jose M. Acitores Cortina
Yasaman Fatapour
Michael Zietz
Kathleen LaRow Brown
Undina Gisladottir
Danner Peter
Oliver John Bear Don’t Walk
Aditi Kuchi
Apoorva Srinivasan
Hongyu Liu
Jacob Berkowitz
Kevin Tsang
Nadine Friedrich
Sophia Kievelson
Nicholas P. Tatonetti

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Objective

Integrated clinical databases from national biobanks have advanced the capacity for disease research. Data quality and completeness filters are used when building clinical cohorts to address limitations of data missingness. However, these filters may unintentionally introduce systemic biases when they are correlated with race and ethnicity. In this study, we examined the race/ethnicity biases introduced by applying common filters to four clinical records databases.

Materials and Methods

We used 19 filters commonly used in electronic health records research on the availability of demographics, medication records, visit details, observation periods, and other data types. We evaluated the effect of applying these filters on self-reported race and ethnicity. This assessment was performed across four databases comprising approximately 12 million patients.

Results

Applying the observation period filter led to a substantial reduction in data availability across all races and ethnicities in all four datasets. However, among those examined, the availability of data in the white group remained consistently higher compared to other racial groups after applying each filter. Conversely, the Black/African American group was the most impacted by each filter on these three datasets, Cedars-Sinai dataset, UK-Biobank, and Columbia University Dataset.

Discussion and Conclusion

Our findings underscore the importance of using only necessary filters as they might disproportionally affect data availability of minoritized racial and ethnic populations. Researchers must consider these unintentional biases when performing data-driven research and explore techniques to minimize the impact of these filters, such as probabilistic methods or the use of machine learning and artificial intelligence.

Version published to 10.1101/2024.10.04.24314914 on medRxiv
Oct 7, 2024

Inequities in Healthcare-Associated Infections Across North America- A Systematic Review

This article has 1 author:
1. BDS MPH ScD(c) Chandni Shahdev
This article has no evaluationsLatest version Dec 30, 2025
Evaluating Imputation Methods for Handling Missing Data in Complex Survey Designs: Evidence from the India DHS 2017–18

This article has 6 authors:
1. Mahfuzer Rohman
2. Md Sabbir Hossain
3. Md Fakrul Islam
4. Prosenjit Basak Arka
5. Md Rafi Hasan
6. Md Jamal Uddin
This article has no evaluationsLatest version Jan 23, 2026
Retired doctors as users of patient-facing electronic health records: a mixed-methods survey in the UK and Spain

This article has 9 authors:
1. Ray B Jones
2. Angeles Lazcoz
3. Shang-Ming Zhou
4. Brian McMillan
5. Olanrewaju Bamidele
6. Richard Fitton
7. Brian Fisher
8. Mar Soler-Lopez
9. Charlotte Blease
This article has no evaluationsLatest version Feb 4, 2026

Discuss this preprint

Listed in

Abstract

Objective

Materials and Methods

Results

Discussion and Conclusion

Article activity feed

Related articles

Inequities in Healthcare-Associated Infections Across North America- A Systematic Review

Evaluating Imputation Methods for Handling Missing Data in Complex Survey Designs: Evidence from the India DHS 2017–18

Retired doctors as users of patient-facing electronic health records: a mixed-methods survey in the UK and Spain