Federated Learning of Electronic Health Records to Improve Mortality Prediction in Hospitalized Patients With COVID-19: Machine Learning Approach

This article has been Reviewed by the following groups

Read the full article

Abstract

Machine learning models require large datasets that may be siloed across different health care institutions. Machine learning studies that focus on COVID-19 have been limited to single-hospital data, which limits model generalizability.

Objective

We aimed to use federated learning, a machine learning technique that avoids locally aggregating raw clinical data across multiple institutions, to predict mortality in hospitalized patients with COVID-19 within 7 days.

Methods

Patient data were collected from the electronic health records of 5 hospitals within the Mount Sinai Health System. Logistic regression with L1 regularization/least absolute shrinkage and selection operator (LASSO) and multilayer perceptron (MLP) models were trained by using local data at each site. We developed a pooled model with combined data from all 5 sites, and a federated model that only shared parameters with a central aggregator.

Results

The LASSOfederated model outperformed the LASSOlocal model at 3 hospitals, and the MLPfederated model performed better than the MLPlocal model at all 5 hospitals, as determined by the area under the receiver operating characteristic curve. The LASSOpooled model outperformed the LASSOfederated model at all hospitals, and the MLPfederated model outperformed the MLPpooled model at 2 hospitals.

Conclusions

The federated learning of COVID-19 electronic health record data shows promise in developing robust predictive models without compromising patient privacy.

Article activity feed

  1. SciScore for 10.1101/2020.08.11.20172809: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    Institutional Review Board Statementnot detected.
    Randomizationnot detected.
    Blindingnot detected.
    Power Analysisnot detected.
    Sex as a biological variablenot detected.

    Table 2: Resources

    No key resources detected.


    Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).


    Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:
    We note a few limitations of our study. First, data collection was limited to MSHS hospitals in NYC. This may limit model generalizability to hospitals in other regions. Also, this study focused on applying federated learning to predict outcomes based on patient EHR data in principle rather than creating an operational framework for immediate deployment. As such, there are various aspects of the federated learning process that this work does not address such as load balancing, convergence, and scaling. These models included only clinical data and could be enhanced by incorporating other modalities such as imaging or free-text. We only implemented two widely used classifiers within this framework, but other algorithms may perform better. Finally, identical MLP architectures were used across all learning strategies for direct comparisons but could have been further optimized. Future work will focus on accessibility and expanding analysis of federated models. We plan to release code written within common data model EHR formats to better facilitate scalability. We will study salient features of importance for federated models and analyze changes as data are added. Finally, we will integrate additional data types such as images to improve model performance. We aim to use this federated learning framework to predict other adverse outcomes in hospitalized COVID-19 patients such as acute kidney injury.

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.