Cov2clusters: genomic clustering of SARS-CoV-2 sequences

Abstract

Background

The COVID-19 pandemic remains a global public health concern. Advances in sequencing technologies has allowed for high numbers of SARS-CoV-2 whole genome sequence (WGS) data and rapid sharing of sequences through global repositories to enable almost real-time genomic analysis of the pathogen. WGS data has been used previously to group genetically similar viral pathogens to reveal evidence of transmission, including methods that identify distinct clusters on a phylogenetic tree. Identifying clusters of linked cases can aid in the regional surveillance and management of the disease. In this study, we present a novel method for producing stable genomic clusters of SARS-CoV-2 cases, cov2clusters, and compare the accuracy and stability of our approach to previous methods used for phylogenetic clustering using real-world SARS-CoV-2 sequence data obtained from British Columbia, Canada.

Results

We found that cov2clusters produced more stable clusters than previously used phylogenetic clustering methods when adding sequence data through time, mimicking an increase in sequence data through the pandemic. Our method also showed high accuracy when predicting epidemiologically informed clusters from sequence data.

Conclusions

Our new approach allows for the identification of stable clusters of SARS-CoV-2 from WGS data. Producing high-resolution SARS-CoV-2 clusters from sequence data alone can a challenge and, where possible, both genomic and epidemiological data should be used in combination.

SciScore for 10.1101/2022.03.10.22272213: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

Ethics	not detected.
Sex as a biological variable	not detected.
Randomization	Sampling strategies included random sampling (ranging from 5-100% of cases at different periods) and targeted sampling (outbreaks and targeted populations such as travellers)25.
Blinding	not detected.
Power Analysis	not detected.

Table 2: Resources

Software and Algorithms
Sentences	Resources
Consensus sequences were generated using the Connor Laboratory pipeline (https://github.com/connor-lab/ncov2019-artic-nf) with consensus bases called at a frequency of 0.75 with a subsampling read count strategy.	Connor Laboratory suggested: None
Consensus sequences were aligned and trimmed to Wuhan-Hu-1 reference sequence (Accession MN908947, …

SciScore for 10.1101/2022.03.10.22272213: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

Ethics	not detected.
Sex as a biological variable	not detected.
Randomization	Sampling strategies included random sampling (ranging from 5-100% of cases at different periods) and targeted sampling (outbreaks and targeted populations such as travellers)25.
Blinding	not detected.
Power Analysis	not detected.

Table 2: Resources

Software and Algorithms
Sentences	Resources
Consensus sequences were generated using the Connor Laboratory pipeline (https://github.com/connor-lab/ncov2019-artic-nf) with consensus bases called at a frequency of 0.75 with a subsampling read count strategy.	Connor Laboratory suggested: None
Consensus sequences were aligned and trimmed to Wuhan-Hu-1 reference sequence (Accession MN908947, Version MN908947.3) using MAFFT (v7.471) 27 prior to phylogenetic tree production.	MAFFT suggested: (MAFFT, RRID:SCR_011811)
Phylogenetic analyses: A multiple sequence alignment of the full SARS-CoV-2 genome was used to construct maximum-likelihood (M-L) phylogenetic trees with IQ-TREE (v.2.1.3) 28.	IQ-TREE suggested: (IQ-TREE, RRID:SCR_017254)

Results from OddPub: Thank you for sharing your code.

Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:

One limitation of our study is that we do not have exposure, contact or location information to explore this application. Sequences belonging to a P.1 sublineage (P.1.14) form a single, large cluster (illustrated as the red cluster in the delta wave dataset in Figure 2), coinciding with a high number of low-diversity P.1 cases present in BC from April 2021 onwards 22, where almost all P.1 samples were within 0-1 SNPs of another P.1 sequence. This phenomenon is also expected with the recent Omicron variant, where rapid spread has led to high numbers of low diversity cases 23. Increasing the probability threshold to 0.9 (or conducting phylogenetic clustering with a smaller maximum clade divergence threshold) breaks up the cluster into smaller groups of identical or near-identical sequences, but this does not reflect genuine underlying clustering (Supplementary figure S2). In such circumstances, we recommend including additional metadata to refine clusters into genetically related groups with shared demography and epidemiology. Alternatively, our approach could be used as a surveillance tool focusing on a particular individuals or settings of interest, identifying sequences that are linked to the focal individuals or exposure sites, moving outwards to a desired number of “rings”. While COVID-19 remains at pandemic levels with high case numbers in many regions globally, it is anticipated that there will be a shift to endemicity characterized by persistent, lower levels of the dis...

Results from TrialIdentifier: No clinical trial numbers were referenced.

Results from Barzooka: We did not find any issues relating to the usage of bar graphs.

Results from JetFighter: We did not find any issues relating to colormaps.

Results from rtransparent:

Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
No protocol registration statement was detected.

Results from scite Reference Check: We found no unreliable references.

Read the original source

Cov2clusters: genomic clustering of SARS-CoV-2 sequences

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Background

Results

Conclusions

Article activity feed

Three phylogenetic metrics are compatible with natural evolution of the earliest SARS-CoV-2 sequence

Coevolving Mutations in Chronic SARS-CoV-2 Infections

Landscape of non-SARS-CoV-2 respiratory virus sequence data in Africa

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Background

Results

Conclusions

Article activity feed

Related articles

Three phylogenetic metrics are compatible with natural evolution of the earliest SARS-CoV-2 sequence

Coevolving Mutations in Chronic SARS-CoV-2 Infections

Landscape of non-SARS-CoV-2 respiratory virus sequence data in Africa