Profiling SARS-CoV-2 mutation fingerprints that range from the viral pangenome to individual infection quasispecies

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Background

The genome of SARS-CoV-2 is susceptible to mutations during viral replication due to the errors generated by RNA-dependent RNA polymerases. These mutations enable the SARS-CoV-2 to evolve into new strains. Viral quasispecies emerge from de novo mutations that occur in individual patients. In combination, these sets of viral mutations provide distinct genetic fingerprints that reveal the patterns of transmission and have utility in contact tracing.

Methods

Leveraging thousands of sequenced SARS-CoV-2 genomes, we performed a viral pangenome analysis to identify conserved genomic sequences. We used a rapid and highly efficient computational approach that relies on k-mers, short tracts of sequence, instead of conventional sequence alignment. Using this method, we annotated viral mutation signatures that were associated with specific strains. Based on these highly conserved viral sequences, we developed a rapid and highly scalable targeted sequencing assay to identify mutations, detect quasispecies variants, and identify mutation signatures from patients. These results were compared to the pangenome genetic fingerprints.

Results

We built a k-mer index for thousands of SARS-CoV-2 genomes and identified conserved genomics regions and landscape of mutations across thousands of virus genomes. We delineated mutation profiles spanning common genetic fingerprints (the combination of mutations in a viral assembly) and a combination of mutations that appear in only a small number of patients. We developed a targeted sequencing assay by selecting primers from the conserved viral genome regions to flank frequent mutations. Using a cohort of 100 SARS-CoV-2 clinical samples, we identified genetic fingerprints consisting of strain-specific mutations seen across populations and de novo quasispecies mutations localized to individual infections. We compared the mutation profiles of viral samples undergoing analysis with the features of the pangenome.

Conclusions

We conducted an analysis for viral mutation profiles that provide the basis of genetic fingerprints. Our study linked pangenome analysis with targeted deep sequenced SARS-CoV-2 clinical samples. We identified quasispecies mutations occurring within individual patients and determined their general prevalence when compared to over 70,000 other strains. Analysis of these genetic fingerprints may provide a way of conducting molecular contact tracing.

Article activity feed

  1. SciScore for 10.1101/2020.11.02.20224816: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    Institutional Review Board StatementIRB: The Institutional Review Board at Stanford University School of Medicine approved the study protocol (IRB-56088).
    RandomizationBriefly, this library preparation consists of a two- step transposase-based process referred to as tagmentation, where sequencing adapters are randomly inserted into the PCR amplicons DNA by transposition.
    Blindingnot detected.
    Power Analysisnot detected.
    Sex as a biological variablenot detected.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    We also identified 447 human coronavirus genomes by selecting “host: human”, “complete genome”, and “date: up to 2019 Oct” from ViPR [26].
    ViPR
    suggested: (vipR, RRID:SCR_010685)
    This tool is written in the Julia programming language, a high-performance, general programming language interoperable with Python, R, C, and Fortran; we also provide a command-line interface for language-agnostic use and to support use in pipelines with other software.
    Python
    suggested: (IPython, RRID:SCR_001658)
    Bioinformatic sequencing analysis: Raw sequence data underwent base calling and demultiplexing using bcl2fastq (v2.20).
    bcl2fastq
    suggested: (bcl2fastq , RRID:SCR_015058)
    Reads were aligned using bwa (mem algorithm; v0.7.17) and processed into bam files using samtools (v1.10).
    samtools
    suggested: (SAMTOOLS, RRID:SCR_002105)
    Reads aligning to either the human or viral genome were counted by using the command “samtools idxstats,” and per-base coverage metrics were analyzed using bedtools (v2.29) using the “bedtools coverage -d” command, selecting only for the SARS-CoV-2 genome.
    bedtools
    suggested: (BEDTools, RRID:SCR_006646)

    Results from OddPub: Thank you for sharing your data.


    Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:
    However, one limitation of this deep sequencing approach is that the coverage is restricted to only approximately 40% of the viral genome. As a result, not all mutations will be detected. To overcome this issue, future iterations of assay design will expand the number of amplicons for broader coverage, while maintaining primers in conserved sequences. Thus, based on our pangenome analysis, there are opportunities to target mutation prone sequences as they appear in the population. Our sequencing assay possesses important operational advantages compared to other molecular detection methods. Following sample processing, numerous individual specimens can be pooled during library preparation and maintain their unique identity. Identification of individual samples relies on assignments from DNA-based sample barcodes. Another advantage is the use of a library normalization procedure that eliminates the need for library balancing and simplifies the workflow for sequencing. For small genomes like SARS-CoV-2, one can sequence tens of thousands of samples in single sequencing run, depending on the capacity of the sequencer. This scalability feature makes the analysis of large numbers of samples feasible compared to other assays which require that samples be maintained in individual wells. The operational scalability of NGS also enables one to conduct large scale population screening with the potential for significant cost reduction compared to other methods.

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a protocol registration statement.

    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.