A generalizable data assembly algorithm for infectious disease outbreaks

Abstract

During infectious disease outbreaks, health agencies often share text-based information about cases and deaths. This information is rarely machine-readable, thus creating challenges for outbreak researchers. Here, we introduce a generalizable data assembly algorithm that automatically curates text-based, outbreak-related information and demonstrate its performance across 3 outbreaks. After developing an algorithm with regular expressions, we automatically curated data from health agencies via 3 information sources: formal reports, email newsletters, and Twitter. A validation data set was also curated manually for each outbreak, and an implementation process was presented for application to future outbreaks. When compared against the validation data sets, the overall cumulative missingness and misidentification of the algorithmically curated data were ≤2% and ≤1%, respectively, for all 3 outbreaks. Within the context of outbreak research, our work successfully addresses the need for generalizable tools that can transform text-based information into machine-readable data across varied information sources and infectious diseases.

SciScore for 10.1101/2021.04.21.21255862: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

Ethics	Field Sample Permit: Similar aggregate statistics were also collected for the Ebola outbreak in the DRC from email newsletters issued by the Ministère de la Santé RDC (MSRDC) from August 6, 2018 (date of first newsletter received) to July 31, 2019 (date of last newsletter received) [11,12].
Sex as a biological variable	not detected.
Randomization	not detected.
Blinding	not detected.
Power Analysis	not detected.

Table 2: Resources

Software and Algorithms
Sentences	Resources
The assembly algorithm was developed in the Python programming language and, as shown in Figure 1, uses regular expressions and trigger phrases to automatically transform semi-structured text-based information into machine-readable …

SciScore for 10.1101/2021.04.21.21255862: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

Ethics	Field Sample Permit: Similar aggregate statistics were also collected for the Ebola outbreak in the DRC from email newsletters issued by the Ministère de la Santé RDC (MSRDC) from August 6, 2018 (date of first newsletter received) to July 31, 2019 (date of last newsletter received) [11,12].
Sex as a biological variable	not detected.
Randomization	not detected.
Blinding	not detected.
Power Analysis	not detected.

Table 2: Resources

Software and Algorithms
Sentences	Resources
The assembly algorithm was developed in the Python programming language and, as shown in Figure 1, uses regular expressions and trigger phrases to automatically transform semi-structured text-based information into machine-readable data.	Python suggested: (IPython, RRID:SCR_001658)

Results from OddPub: Thank you for sharing your code and data.

Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

Results from TrialIdentifier: No clinical trial numbers were referenced.

Results from Barzooka: We did not find any issues relating to the usage of bar graphs.

Results from JetFighter: We did not find any issues relating to colormaps.

Results from rtransparent:

Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
No protocol registration statement was detected.

Results from scite Reference Check: We found no unreliable references.

Read the original source

A generalizable data assembly algorithm for infectious disease outbreaks

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

EpidBot: A Natural Language Platform for Generalized Epidemic Intelligence

A large language model-assisted workflow for generating a living evidence base for climate-sensitive foodborne disease

Data processing pipelines and tools for routine health facility malaria surveillance in Uganda

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

EpidBot: A Natural Language Platform for Generalized Epidemic Intelligence

A large language model-assisted workflow for generating a living evidence base for climate-sensitive foodborne disease

Data processing pipelines and tools for routine health facility malaria surveillance in Uganda