Data-driven analysis of amino acid change dynamics timely reveals SARS-CoV-2 variant emergence

Abstract

Since its emergence in late 2019, the diffusion of SARS-CoV-2 is associated with the evolution of its viral genome. The co-occurrence of specific amino acid changes, collectively named ‘virus variant’, requires scrutiny (as variants may hugely impact the agent’s transmission, pathogenesis, or antigenicity); variant evolution is studied using phylogenetics. Yet, never has this problem been tackled by digging into data with ad hoc analysis techniques. Here we show that the emergence of variants can in fact be traced through data-driven methods, further capitalizing on the value of large collections of SARS-CoV-2 sequences. For all countries with sufficient data, we compute weekly counts of amino acid changes, unveil time-varying clusters of changes with similar—rapidly growing—dynamics, and then follow their evolution. Our method succeeds in timely associating clusters to variants of interest/concern, provided their change composition is well characterized. This allows us to detect variants’ emergence, rise, peak, and eventual decline under competitive pressure of another variant. Our early warning system, exclusively relying on deposited sequences, shows the power of big data in this context, and concurs to calling for the wide spreading of public SARS-CoV-2 genome sequencing for improved surveillance and control of the COVID-19 pandemic.

SciScore for 10.1101/2021.07.12.452076: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
Data extraction and aggregation was performed using PostgreSQL (Version 12.5) and the Pandas library (Version 1.2.1) of Python (Version 3.8.5).	Python suggested: (IPython, RRID:SCR_001658)
At each time t, we retain the n(t) time-series of change prevalence (current ratio between change counts and total counts of changes) that are observed continuously over a time interval of four weeks prior to t ([t − w + 1 · · · t], with w = 5) and partition them via k-medoids clustering48 (PAM algorithm, kmedoids function in MATLAB), with pairwise distances between time-series being evaluated via dynamic …

SciScore for 10.1101/2021.07.12.452076: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
Data extraction and aggregation was performed using PostgreSQL (Version 12.5) and the Pandas library (Version 1.2.1) of Python (Version 3.8.5).	Python suggested: (IPython, RRID:SCR_001658)
At each time t, we retain the n(t) time-series of change prevalence (current ratio between change counts and total counts of changes) that are observed continuously over a time interval of four weeks prior to t ([t − w + 1 · · · t], with w = 5) and partition them via k-medoids clustering48 (PAM algorithm, kmedoids function in MATLAB), with pairwise distances between time-series being evaluated via dynamic time warping49 (dtw in MATLAB).	MATLAB suggested: (MATLAB, RRID:SCR_001622)

Results from OddPub: Thank you for sharing your data.

Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:

We do not see this outcome as an intrinsic limitation of our approach, though. In fact, even when not selected as hits, warnings are informative of the dynamics underlying variant emergence and inter-variant competition; in addition, similarity analysis can clarify, typically in a matter of a few weeks at most, whether an emerging variant is gaining traction. Relying exclusively on big data for variant scouting is hampered by three problems: (i) consistency of sampling, (ii) delay of deposition, and (iii) biases in sampling. As for (i), the ratio between the number of sequences and reported COVID-19 cases widely varies among countries and drops from 77% in Iceland (clearly facilitated by small number of cases) to below 0.1% for many countries, also including some large ones like India and Brazil. Among the countries with lots of cases, the UK stands with an exceptionally high ratio exceeding 9%. Indeed, US and UK have contributed the largest number of sequences worldwide (517k and 425k, respectively). Extended Data Table 3 shows these statistics for all countries contributing to GISAID with more than 1,000 sequences, whereas Extended Data Table 4 shows statistics for the 50 US states. Regarding (ii), as an example, in the UK the average delay between collection and deposition amounts to 24 days. This delay tended to reduce as the pandemic unfolded, from 38 days in 2020 to just 16 days in 2021. Iceland is again striking the best performance, with 11 days of average delay in 20...

Results from TrialIdentifier: No clinical trial numbers were referenced.

Results from Barzooka: We did not find any issues relating to the usage of bar graphs.

Results from JetFighter: We did not find any issues relating to colormaps.

Results from rtransparent:

Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
No protocol registration statement was detected.

Results from scite Reference Check: We found no unreliable references.

Read the original source

Data-driven analysis of amino acid change dynamics timely reveals SARS-CoV-2 variant emergence

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Genomic characterization of SARS-CoV-2 variants circulating in the population of Bangui, Central African Republic (CAR) in 2022.

Dengue Virus Type 2: Global Epidemiology, Molecular Evolution, and Immune Response Insights

Insights into Genomic Dynamics and Plasticity in the Monkeypox Virus from the 2022 Outbreak

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Genomic characterization of SARS-CoV-2 variants circulating in the population of Bangui, Central African Republic (CAR) in 2022.

Dengue Virus Type 2: Global Epidemiology, Molecular Evolution, and Immune Response Insights

Insights into Genomic Dynamics and Plasticity in the Monkeypox Virus from the 2022 Outbreak