A generalizable data assembly algorithm for infectious disease outbreaks
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (ScreenIT)
Abstract
During infectious disease outbreaks, health agencies often share text-based information about cases and deaths. This information is rarely machine-readable, thus creating challenges for outbreak researchers. Here, we introduce a generalizable data assembly algorithm that automatically curates text-based, outbreak-related information and demonstrate its performance across 3 outbreaks. After developing an algorithm with regular expressions, we automatically curated data from health agencies via 3 information sources: formal reports, email newsletters, and Twitter. A validation data set was also curated manually for each outbreak, and an implementation process was presented for application to future outbreaks. When compared against the validation data sets, the overall cumulative missingness and misidentification of the algorithmically curated data were ≤2% and ≤1%, respectively, for all 3 outbreaks. Within the context of outbreak research, our work successfully addresses the need for generalizable tools that can transform text-based information into machine-readable data across varied information sources and infectious diseases.
Article activity feed
-
-
SciScore for 10.1101/2021.04.21.21255862: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
Ethics Field Sample Permit: Similar aggregate statistics were also collected for the Ebola outbreak in the DRC from email newsletters issued by the Ministère de la Santé RDC (MSRDC) from August 6, 2018 (date of first newsletter received) to July 31, 2019 (date of last newsletter received) [11,12]. Sex as a biological variable not detected. Randomization not detected. Blinding not detected. Power Analysis not detected. Table 2: Resources
Software and Algorithms Sentences Resources The assembly algorithm was developed in the Python programming language and, as shown in Figure 1, uses regular expressions and trigger phrases to automatically transform semi-structured text-based information into machine-readable … SciScore for 10.1101/2021.04.21.21255862: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
Ethics Field Sample Permit: Similar aggregate statistics were also collected for the Ebola outbreak in the DRC from email newsletters issued by the Ministère de la Santé RDC (MSRDC) from August 6, 2018 (date of first newsletter received) to July 31, 2019 (date of last newsletter received) [11,12]. Sex as a biological variable not detected. Randomization not detected. Blinding not detected. Power Analysis not detected. Table 2: Resources
Software and Algorithms Sentences Resources The assembly algorithm was developed in the Python programming language and, as shown in Figure 1, uses regular expressions and trigger phrases to automatically transform semi-structured text-based information into machine-readable data. Pythonsuggested: (IPython, RRID:SCR_001658)Results from OddPub: Thank you for sharing your code and data.
Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.Results from TrialIdentifier: No clinical trial numbers were referenced.
Results from Barzooka: We did not find any issues relating to the usage of bar graphs.
Results from JetFighter: We did not find any issues relating to colormaps.
Results from rtransparent:- Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
- Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
- No protocol registration statement was detected.
Results from scite Reference Check: We found no unreliable references.
-