Automatic identification of risk factors for SARS-CoV-2 positivity and severe clinical outcomes of COVID-19 using Data Mining and Natural Language Processing
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (ScreenIT)
Abstract
Objectives
Several risk factors have been identified for severe clinical outcomes of COVID-19 caused by SARS-CoV-2. Some can be found in structured data of patients’ Electronic Health Records. Others are included as unstructured free-text, and thus cannot be easily detected automatically. We propose an automated real-time detection of risk factors using a combination of data mining and Natural Language Processing (NLP).
Material and methods
Patients were categorized as negative or positive for SARS-CoV-2, and according to disease severity (severe or non-severe COVID-19). Comorbidities were identified in the unstructured free-text using NLP. Further risk factors were taken from the structured data.
Results
6250 patients were analysed (5664 negative and 586 positive; 461 non-severe and 125 severe). Using NLP, comorbidities, i.e. cardiovascular and pulmonary conditions, diabetes, dementia and cancer, were automatically detected (error rate ≤2%). Old age, male sex, higher BMI, arterial hypertension, chronic heart failure, coronary heart disease, COPD, diabetes, insulin only treatment of diabetic patients, reduced kidney and liver function were risk factors for severe COVID-19. Interestingly, the proportion of diabetic patients using metformin but not insulin was significantly higher in the non-severe COVID-19 cohort (p<0.05).
Discussion and conclusion
Our findings were in line with previously reported risk factors for severe COVID-19. NLP in combination with other data mining approaches appears to be a suitable tool for the automated real-time detection of risk factors, which can be a time saving support for risk assessment and triage, especially in patients with long medical histories and multiple comorbidities.
Article activity feed
-
SciScore for 10.1101/2021.03.25.21254314: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
Institutional Review Board Statement IRB: The protocol was approved by the Cantonal Ethics Committee of Bern (Project-ID 2020-00973).
Consent: We considered all individuals tested for SARS-CoV-2 at the IHG between February 1st through November 16th 2020– covering the ‘first wave’ and part of the ‘second wave’ of COVID-19 in the country, and who did not reject the IHG general research consent.Randomization not detected. Blinding not detected. Power Analysis Consequently, no formal power calculations were performed a priori. Sex as a biological variable not detected. Table 2: Resources
No key resources detected.
Results from OddPub: We did not detect open data. We also did not detect open code. Researchers …
SciScore for 10.1101/2021.03.25.21254314: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
Institutional Review Board Statement IRB: The protocol was approved by the Cantonal Ethics Committee of Bern (Project-ID 2020-00973).
Consent: We considered all individuals tested for SARS-CoV-2 at the IHG between February 1st through November 16th 2020– covering the ‘first wave’ and part of the ‘second wave’ of COVID-19 in the country, and who did not reject the IHG general research consent.Randomization not detected. Blinding not detected. Power Analysis Consequently, no formal power calculations were performed a priori. Sex as a biological variable not detected. Table 2: Resources
No key resources detected.
Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).
Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:Our study has several limitations. As the IHG is a major hospital centre in the region, patients admitted to the hospital for other reasons were also tested for SARS-CoV-2 if they displayed any symptoms indicative for COVID-19. The patients in the SARS-CoV-2 negative cohort probably have more health-related problems than the general population. This effect is further corroborated as we only analysed patients with available EHRs, and, due to the retrospective nature of the study, had no information on patients tested at the ambulant COVID-19 test centre without being admitted to the IHG as in- or out-patients. This was partially mitigated by including records from the three months preceding diagnosis. Therefore the higher incidence of specific comorbidities in the SARS-CoV-2 negative cohort might represent a selection bias. Additionally, we did not perform a case-controlled study or adjusted for cofounding factors such as smoking or age. Furthermore, with the exception of diabetes and renal function, we did no differentiate between the different stages and severities of the diseases. This issue can be addressed by refining the key terms list in a subsequent study. The validated error rate for the NLP detection of the different disease is around 2%. Due to this low error rate, the pronounced contrast between significant features, and the large amount of EHRs analysed, it is not expected to affect statistical conclusions. The main focus of the analysis were patients tested posit...
Results from TrialIdentifier: No clinical trial numbers were referenced.
Results from Barzooka: We did not find any issues relating to the usage of bar graphs.
Results from JetFighter: We did not find any issues relating to colormaps.
Results from rtransparent:- Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
- Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
- No protocol registration statement was detected.
-