Clustering of patient comorbidities within electronic medical records enables high-precision COVID-19 mortality prediction

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

We present an explainable AI framework to predict mortality after a positive COVID-19 diagnosis based solely on data routinely collected in electronic healthcare records (EHRs) obtained prior to diagnosis. We grounded our analysis on the ½ Million people UK Biobank and linked NHS COVID-19 records. We developed a method to capture the complexities and large variety of clinical codes present in EHRs, and we show that these have a larger impact on risk than all other patient data but age. We use a form of clustering for natural language processing of the clinical codes, specifically, topic modelling by Latent Dirichlet Allocation (LDA), to generate a succinct digital fingerprint of a patient’s full secondary care clinical history, i.e. their comorbidities and past interventions. These digital comorbidity fingerprints offer immediately interpretable clinical descriptions that are meaningful, e.g. grouping cardiovascular disorders with common risk factors but also novel groupings that are not obvious. The comorbidity fingerprints differ in both their breadth and depth from existing observational disease associations in the COVID-19 literature. Taking this data-driven approach allows us to avoid human-induction bias and confirmation bias during selection of what are important potential predictors of COVID-19 mortality. Together with age, these digital fingerprints are the single most important factor in our predictor. This holds the potential for improving individual risk profiling for clinical decisions and the identification of groups for public health interventions such as vaccine programmes. Combining our digital precondition fingerprints with demographic characteristics allow us to match or exceed the performance of existing state-of-the-art COVID-19 mortality predictors (EHCF) which have been developed through expert consensus. Our precondition fingerprinting and entire mortality prediction analytics pipeline are designed so as to be rapidly redeployable, e.g. for COVID-19 variants or other pre-existing diseases.

Article activity feed

  1. SciScore for 10.1101/2021.03.29.21254579: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    This model was implemented using the gensim library in the Python programming language53.
    Python
    suggested: (IPython, RRID:SCR_001658)

    Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).


    Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:
    There are a few limitations to our study. Firstly, our data set while vast, may also reflect the inherent bias of the UK Biobank39, which has been discussed in detail elsewhere39–41; notably, the demographic reflect a “healthy volunteer” bias, with individuals being generally older, from more educated, less deprived socioeconomic backgrounds, and with significant under-representation of ethnic minorities compared to the UK population. Secondly, testing, treatments, and outcomes of COVID-19 have continuously improved during the study period, thus possibly having a confounding effect on the results. Moreover, due to the limited availability of testing kits for COVID-19, priority was initially offered to those considered at a higher clinical risk, thus potentially leading to an overestimation of severe outcomes in the database. However, these COVID-19 specific data bias factors only affect the mortality prediction but not the structure of the DCFs. This study further relied on retrospective secondary care EHRs and the model is therefore currently blind to conditions entirely managed in primary care, such conditions include many less severe cases of diabetes, asthma and hypertension. At the time of the study, data from primary care was not available. Future work will be needed to incorporate data from General Practices into the development of the DCFs. Balancing the strengths and limitations, we consider our derivation population to be relevant for the initial exploration of COVI...

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.