Using multiple sampling strategies to estimate SARS-CoV-2 epidemiological parameters from genomic sequencing data

Abstract

The choice of viral sequences used in genetic and epidemiological analysis is important as it can induce biases that detract from the value of these rich datasets. This raises questions about how a set of sequences should be chosen for analysis. We provide insights on these largely understudied problems using SARS-CoV-2 genomic sequences from Hong Kong, China, and the Amazonas State, Brazil. We consider multiple sampling schemes which were used to estimate R _t and r _t as well as related R ₀ and date of origin parameters. We find that both R _t and r _t are sensitive to changes in sampling whilst R ₀ and the date of origin are relatively robust. Moreover, we find that analysis using unsampled datasets result in the most biased R _t and r _t estimates for both our Hong Kong and Amazonas case studies. We highlight that sampling strategy choices may be an influential yet neglected component of sequencing analysis pipelines.

SciScore for 10.1101/2022.02.04.22270165: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
Using the Accession ID of each sequence, all sequences were screened and only sequences previously analysed and published in PubMed, MedRxiv, BioRxiv, virological or Preprint repositories were selected for subsequent analysis.	PubMed suggested: (PubMed, RRID:SCR_004846) BioRxiv suggested: (bioRxiv, RRID:SCR_003933)
The gradient of the slopes (clock rates) provided by TempEst were used to inform the clock prior in the phylodynamic analysis.	TempEst suggested: (TempEst, RRID:SCR_017304)
Bayesian Evolutionary Analysis: Date molecular clock phylogenies were inferred for all sampling strategies …

SciScore for 10.1101/2022.02.04.22270165: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
Using the Accession ID of each sequence, all sequences were screened and only sequences previously analysed and published in PubMed, MedRxiv, BioRxiv, virological or Preprint repositories were selected for subsequent analysis.	PubMed suggested: (PubMed, RRID:SCR_004846) BioRxiv suggested: (bioRxiv, RRID:SCR_003933)
The gradient of the slopes (clock rates) provided by TempEst were used to inform the clock prior in the phylodynamic analysis.	TempEst suggested: (TempEst, RRID:SCR_017304)
Bayesian Evolutionary Analysis: Date molecular clock phylogenies were inferred for all sampling strategies applied to the Amazonas and Hong Kong dataset using BEAST v1.10.4 (Suchard et al., 2018) with BEAGLE library v3.1.0 (Ayres et al., 2019) for accelerated likelihood evaluation.	BEAGLE suggested: (BEAGLE, RRID:SCR_001789)
Subsequently, 10% of all trees were discarded as burn in, and the effective sample size of parameter estimates were evaluated using TRACER v1.7.2 (Rambaut et al., 2018).	TRACER suggested: (Tracer, RRID:SCR_019121)
Phylodynamic Reconstruction: Estimation of the Reproduction Number and Time-varying Effective Reproduction Number The Bayesian birth-death skyline (BDSKY) model (Stadler et al., 2013) implemented within BEAST 2 v2.6.5 (Bouckaert et al., 2019) was used to estimate time-varying rates of epidemic transmission, measured as changes in Rt (Table 2).	BEAST suggested: (BEAST, RRID:SCR_010228)
The four independent MCMC runs were combined using LogCombiner v2.6.5. (Bouckaert et al., 2019) and the effective sample size of parameter estimates were evaluated using TRACER v1.7.2 (Rambaut et al., 2018).	LogCombiner suggested: (BEAST2, RRID:SCR_017307)

Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).

Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:

While our results provide a rigorous underpinning and insight into the dynamics of SARS-CoV-2 and the impact of sampling strategies in the Amazonas region and Hong Kong, there are limitations. The Skygrowth and BDSKY models do not explicitly consider imports into their respective regions. This is particularly relevant for Hong Kong as most initial sequences from the region were sequenced from importation events (Adam et al., 2020) which can introduce error into parameter estimation. However, as the epidemic expanded, more infections were attributable to autochthonous transmission (Adam et al., 2020), and the risk of error introduced by importation events decreased. Moreover, while sampling strategies can account for temporal variations in genomic sampling fractions there is currently no way to account for non-random sampling approaches in either the BDSKY or Skygrowth models (Vasylyeva et al., 2020). It is unclear how network-based sampling may affect parameter estimates obtained through these models (Volz, Koelle and Bedford, 2013) presenting a key challenge in molecular and genetic epidemiology. Spatial heterogeneities were also not explored within this work. This represents the next key step in understanding the impact of sampling as spatial sampling schemes would allow the reconstruction of the dispersal dynamics and estimation of epidemic overdispersion (k), a key epidemiological parameter. This work has highlighted the impact and importance that applying temporal sampli...

Results from TrialIdentifier: No clinical trial numbers were referenced.

Results from Barzooka: We did not find any issues relating to the usage of bar graphs.

Results from JetFighter: We did not find any issues relating to colormaps.

Results from rtransparent:

Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
No protocol registration statement was detected.

Results from scite Reference Check: We found no unreliable references.

Read the original source

Using multiple sampling strategies to estimate SARS-CoV-2 epidemiological parameters from genomic sequencing data

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Genomic characterization of SARS-CoV-2 variants circulating in the population of Bangui, Central African Republic (CAR) in 2022.

Estimating the effect of self-protection on transmission dynamics of SARS-CoV-2 in Germany in 2021: A modelling study

Overview of SARS-CoV-2 Genomic Surveillance in Central America and the Dominican Republic from February 2020 to January 2023: The Impact of PAHO and COMISCA's Collaborative Efforts

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Genomic characterization of SARS-CoV-2 variants circulating in the population of Bangui, Central African Republic (CAR) in 2022.

Estimating the effect of self-protection on transmission dynamics of SARS-CoV-2 in Germany in 2021: A modelling study

Overview of SARS-CoV-2 Genomic Surveillance in Central America and the Dominican Republic from February 2020 to January 2023: The Impact of PAHO and COMISCA's Collaborative Efforts