Automatic identification of risk factors for SARS-CoV-2 positivity and severe clinical outcomes of COVID-19 using Data Mining and Natural Language Processing

This article has been Reviewed by the following groups

Read the full article

Abstract

Objectives

Several risk factors have been identified for severe clinical outcomes of COVID-19 caused by SARS-CoV-2. Some can be found in structured data of patients’ Electronic Health Records. Others are included as unstructured free-text, and thus cannot be easily detected automatically. We propose an automated real-time detection of risk factors using a combination of data mining and Natural Language Processing (NLP).

Material and methods

Patients were categorized as negative or positive for SARS-CoV-2, and according to disease severity (severe or non-severe COVID-19). Comorbidities were identified in the unstructured free-text using NLP. Further risk factors were taken from the structured data.

Results

6250 patients were analysed (5664 negative and 586 positive; 461 non-severe and 125 severe). Using NLP, comorbidities, i.e. cardiovascular and pulmonary conditions, diabetes, dementia and cancer, were automatically detected (error rate ≤2%). Old age, male sex, higher BMI, arterial hypertension, chronic heart failure, coronary heart disease, COPD, diabetes, insulin only treatment of diabetic patients, reduced kidney and liver function were risk factors for severe COVID-19. Interestingly, the proportion of diabetic patients using metformin but not insulin was significantly higher in the non-severe COVID-19 cohort (p<0.05).

Discussion and conclusion

Our findings were in line with previously reported risk factors for severe COVID-19. NLP in combination with other data mining approaches appears to be a suitable tool for the automated real-time detection of risk factors, which can be a time saving support for risk assessment and triage, especially in patients with long medical histories and multiple comorbidities.

Article activity feed

  1. SciScore for 10.1101/2021.03.25.21254314: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    Institutional Review Board StatementIRB: The protocol was approved by the Cantonal Ethics Committee of Bern (Project-ID 2020-00973).
    Consent: We considered all individuals tested for SARS-CoV-2 at the IHG between February 1st through November 16th 2020– covering the ‘first wave’ and part of the ‘second wave’ of COVID-19 in the country, and who did not reject the IHG general research consent.
    Randomizationnot detected.
    Blindingnot detected.
    Power AnalysisConsequently, no formal power calculations were performed a priori.
    Sex as a biological variablenot detected.

    Table 2: Resources

    No key resources detected.


    Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).


    Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:
    Our study has several limitations. As the IHG is a major hospital centre in the region, patients admitted to the hospital for other reasons were also tested for SARS-CoV-2 if they displayed any symptoms indicative for COVID-19. The patients in the SARS-CoV-2 negative cohort probably have more health-related problems than the general population. This effect is further corroborated as we only analysed patients with available EHRs, and, due to the retrospective nature of the study, had no information on patients tested at the ambulant COVID-19 test centre without being admitted to the IHG as in- or out-patients. This was partially mitigated by including records from the three months preceding diagnosis. Therefore the higher incidence of specific comorbidities in the SARS-CoV-2 negative cohort might represent a selection bias. Additionally, we did not perform a case-controlled study or adjusted for cofounding factors such as smoking or age. Furthermore, with the exception of diabetes and renal function, we did no differentiate between the different stages and severities of the diseases. This issue can be addressed by refining the key terms list in a subsequent study. The validated error rate for the NLP detection of the different disease is around 2%. Due to this low error rate, the pronounced contrast between significant features, and the large amount of EHRs analysed, it is not expected to affect statistical conclusions. The main focus of the analysis were patients tested posit...

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.