Using Primary Care Text Data and Natural Language Processing to Monitor COVID-19 in Toronto, Canada

This article has been Reviewed by the following groups

Read the full article

Abstract

Objective

To investigate whether a rule-based natural language processing (NLP) system, applied to primary care clinical text data, can be used to monitor COVID-19 viral activity in Toronto, Canada.

Design

We employ a retrospective cohort design. We include primary care patients with a clinical encounter between January 1, 2020 and December 31, 2020 at one of 44 participating clinical sites.

Setting and Context

The study setting is Toronto, Canada. During the study timeframe the city experienced a first wave of COVID-19 in spring 2020; followed by a second viral resurgence beginning in the fall of 2020.

Methods and Data

Study objectives are descriptive. We use an expert derived dictionary, pattern matching tools and a contextual analyzer to classify documents as 1) COVID-19 positive, 2) COVID-19 negative, or 3) unknown COVID-19 status. We apply the COVID-19 biosurveillance system across three primary care electronic medical record text streams: 1) lab text, 2) health condition diagnosis text and 3) clinical notes. We enumerate COVID-19 entities in the clinical text and estimate the proportion of patients with a positive COVID-19 record. We construct a primary care COVID-19 NLP-derived time series and investigate its correlation with other external public health series: 1) lab confirmed COVID-19 cases, 2) COVID-19 hospitalizations, 3) COVID-19 ICU admissions, and 4) COVID-19 intubations.

Results

Over the study timeframe 1,976 COVID-19 positive documents, and 277 unique COVID-19 entities were identified in the lab text. 539 COVID-19 positive documents and 121 unique COVID-19 entities were identified in the health condition diagnosis text. And 4,018 COVID-19 positive documents, and 644 unique COVID-19 entities were identified in the clinical notes. A total of 196,440 unique patients were observed over the study timeframe, of which 4,580 (2.3%) had at least one positive COVID-19 document in their primary care electronic medical record. We constructed an NLP-derived COVID-19 time series describing the temporal dynamics of COVID-19 positivity status over the study timeframe. The NLP derived series correlates strongly with external public health series under investigation.

Conclusions

Using a rule-based NLP system we identified hundreds of unique COVID-19 entities, and thousands of COVID-19 positive documents, across millions of clinical text documents. Future work should continue to investigate how high quality, low-cost, passively collected primary care electronic medical record clinical text data can be used for COVID-19 monitoring and surveillance.

Article activity feed

  1. SciScore for 10.1101/2022.04.27.22274400: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    EthicsIRB: Using descriptive time series plots we visually compare how our NLP derived primary care COVID-19 positivity series correlates with other known COVID-19 time series externally derived from Toronto Public Health, including: 1) lab confirmed COVID-19 cases, 2) COVID-19 hospitalizations, 3) COVID-19 ICU admissions, and 4) COVID-19 intubations [Toronto Public Health, 2022]. 2.5 Ethics: This study received ethics approval from North York General Hospital Research Ethics Board (REB ID: NYGH 20-0014).
    Sex as a biological variablenot detected.
    Randomizationnot detected.
    Blindingnot detected.
    Power Analysisnot detected.
    Cell Line AuthenticationAuthentication: The tool was originally developed and validated at the United States Veteran’s Affairs health system [Chapman et al., 2020].

    Table 2: Resources

    No key resources detected.


    Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).


    Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:
    Our study is not without limitations. A chief limitation of our study is that we employ a rule based COVID-19 phenotype algorithm which has undergone minimal internal validation in our local setting (see Appendix A). A challenge with internal validation of COVID-19 phenotyping algorithms relates to sampling design. As illustrated in Appendix A, evaluation of algorithm operating characteristics based on a random sample of clinical notes results in the identification of only a small number of true positive COVID-19 documents (and similarly a small number of algorithm predicted positive documents); hence, internal estimates of sensitivity and positive predictive value are imprecise. Estimation of algorithm operating characteristics from large samples are associated with increased costs (financial costs, time costs and human-resource costs), however, are necessary to achieve precise estimates of sensitivity and positive predictive value. Alternatively, sequential or stratified sampling designs may provide more efficient estimates of algorithm operating characteristics as compared to simple random sampling designs. That said, the original authors did perform their own validation of the COVID-19 biosurveillance tool and report exceptional sensitivity (94.2%) and positive predictive value (82.4%) [Chapman et al., 2020]. Qualitatively, our small validation study suggests similarly effective performance in an entirely different study setting/context. A central objective of our study i...

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    Results from scite Reference Check: We found no unreliable references.


    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.