Data-driven analysis of amino acid change dynamics timely reveals SARS-CoV-2 variant emergence

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Since its emergence in late 2019, the diffusion of SARS-CoV-2 is associated with the evolution of its viral genome. The co-occurrence of specific amino acid changes, collectively named ‘virus variant’, requires scrutiny (as variants may hugely impact the agent’s transmission, pathogenesis, or antigenicity); variant evolution is studied using phylogenetics. Yet, never has this problem been tackled by digging into data with ad hoc analysis techniques. Here we show that the emergence of variants can in fact be traced through data-driven methods, further capitalizing on the value of large collections of SARS-CoV-2 sequences. For all countries with sufficient data, we compute weekly counts of amino acid changes, unveil time-varying clusters of changes with similar—rapidly growing—dynamics, and then follow their evolution. Our method succeeds in timely associating clusters to variants of interest/concern, provided their change composition is well characterized. This allows us to detect variants’ emergence, rise, peak, and eventual decline under competitive pressure of another variant. Our early warning system, exclusively relying on deposited sequences, shows the power of big data in this context, and concurs to calling for the wide spreading of public SARS-CoV-2 genome sequencing for improved surveillance and control of the COVID-19 pandemic.

Article activity feed

  1. SciScore for 10.1101/2021.07.12.452076: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    Data extraction and aggregation was performed using PostgreSQL (Version 12.5) and the Pandas library (Version 1.2.1) of Python (Version 3.8.5).
    Python
    suggested: (IPython, RRID:SCR_001658)
    At each time t, we retain the n(t) time-series of change prevalence (current ratio between change counts and total counts of changes) that are observed continuously over a time interval of four weeks prior to t ([t − w + 1 · · · t], with w = 5) and partition them via k-medoids clustering48 (PAM algorithm, kmedoids function in MATLAB), with pairwise distances between time-series being evaluated via dynamic time warping49 (dtw in MATLAB).
    MATLAB
    suggested: (MATLAB, RRID:SCR_001622)

    Results from OddPub: Thank you for sharing your data.


    Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:
    We do not see this outcome as an intrinsic limitation of our approach, though. In fact, even when not selected as hits, warnings are informative of the dynamics underlying variant emergence and inter-variant competition; in addition, similarity analysis can clarify, typically in a matter of a few weeks at most, whether an emerging variant is gaining traction. Relying exclusively on big data for variant scouting is hampered by three problems: (i) consistency of sampling, (ii) delay of deposition, and (iii) biases in sampling. As for (i), the ratio between the number of sequences and reported COVID-19 cases widely varies among countries and drops from 77% in Iceland (clearly facilitated by small number of cases) to below 0.1% for many countries, also including some large ones like India and Brazil. Among the countries with lots of cases, the UK stands with an exceptionally high ratio exceeding 9%. Indeed, US and UK have contributed the largest number of sequences worldwide (517k and 425k, respectively). Extended Data Table 3 shows these statistics for all countries contributing to GISAID with more than 1,000 sequences, whereas Extended Data Table 4 shows statistics for the 50 US states. Regarding (ii), as an example, in the UK the average delay between collection and deposition amounts to 24 days. This delay tended to reduce as the pandemic unfolded, from 38 days in 2020 to just 16 days in 2021. Iceland is again striking the best performance, with 11 days of average delay in 20...

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    Results from scite Reference Check: We found no unreliable references.


    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.