D3MI: an efficient and powerful federated imputation method for bias reduction in the analysis of distributed incomplete data by accounting for within-site correlation and between-site heterogeneity

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Objective

Electronic health records (EHRs) collected from diverse healthcare institutions offer a rich and representative data source for clinical research. Federated learning enables analysis of these distributed data without sharing sensitive patient-level information, preserving privacy. However, missing data remain a major challenge and can introduce substantial bias if not properly addressed. Very few distributed imputation methods currently exist, and they fail to account for two critical aspects of EHR data: correlation within sites and variability across sites. We aim to fill this important methodological gap.

Methods

We propose Distributed Mixed Model-based Multiple Imputation (D3MI), a novel federated imputation method designed to reduce bias in distributed EHRs. D3MI integrates the strengths from federated learning techniques, statistical learning methods for correlated data, and multilevel imputation algorithms to explicitly account for both and within-site correlation and between-site heterogeneity using site-specific random effects. It preserves privacy by avoiding sharing raw data and features communication and computational efficiency.

Results

Through extensive simulation studies, we demonstrate that D3MI outperforms SOTA distributed imputation methods in both accuracy and consistency. We further demonstrate the use of D3MI in a real-world EHR case study involving incomplete and clustered data from participating hospitals in the Georgia Coverdell Acute Stroke Registry.

Conclusion

By explicitly modeling the complex structure of distributed EHR data, D3MI addresses key limitations of existing approaches. It provides a powerful and efficient solution for handling missing data in distributed and privacy-sensitive settings and enhances the rigor and reproducibility of collaborative clinical research.

Article activity feed